Overview ======== Chunky exposes a modular pipeline for converting heterogeneous project artefacts into well-behaved text chunks. The pipeline is language-aware, pluggable, and ready for Nancy Brain's MCP-backed retrieval workflows. .. note:: See ``design/CHUNKY_V2_SPEC.md`` and ``design/CHUNK_MERGE_SPEC.md`` for implemented v2/v2.1 behavior. ``design/SEMANTIC_CHUNKER.md`` is retained as an archival early design draft. Getting Started --------------- Install the package from PyPI or from source: .. code-block:: bash pip install chunky-files .. code-block:: bash :caption: from source git clone https://github.com/AmberLee2427/chunky.git cd chunky pip install . For development work and documentation builds: .. code-block:: bash pip install -e ".[dev,docs]" First chunks via the pipeline: .. code-block:: python from pathlib import Path from chunky import ChunkPipeline, ChunkerConfig pipeline = ChunkPipeline() config = ChunkerConfig( max_chars=1000, min_chunk_chars=80, # forward-merge tiny chunks into successor chunks lines_per_chunk=40, line_overlap=5, ) chunks = pipeline.chunk_file(Path("/path/to/file.py"), config=config) for chunk in chunks: print(chunk.chunk_id, chunk.metadata["line_start"], chunk.metadata["line_end"]) Built-in chunkers ------------------ * ``PythonSemanticChunker`` — splits modules on top-level functions/classes and captures remaining context. * ``MarkdownHeadingChunker`` — groups content per heading while keeping introductory prose. * ``JSONYamlChunker`` — slices structured configs by their first-level keys/items and falls back if parsing fails. * ``PlainTextChunker`` — groups blank-line separated paragraphs before falling back to sliding windows. * ``FortranChunker`` — captures `program`, `subroutine`, and `function` blocks with minimal heuristics. * ``RSTChunker`` — detects reStructuredText heading sections and chunks by section boundaries. * ``NotebookChunker`` — groups nb4llm notebook exports (`.nb.txt`) into markdown+code context chunks. * Tree-sitter chunkers (optional extra) for C/C++/HTML/Bash when the `tree` extra is installed, with gap-filling so uncaptured lines are still emitted. * ``SlidingWindowChunker`` — deterministic line windows with configurable overlap. Chunk identifiers default to ``#chunk-0000``. Provide ``Document.metadata['doc_id']`` (or set ``ChunkerConfig.doc_id_key``) and adjust the suffix with ``ChunkerConfig.chunk_id_template`` to suit your downstream needs. Forward-merge behavior ---------------------- Set ``ChunkerConfig.min_chunk_chars`` to a positive integer to merge tiny chunks into their successor chunk (or into the predecessor for trailing tiny chunks). This keeps small but meaningful context (imports, decorators, short comments/docstrings) attached to nearby semantic content. Roadmap ------- * Phase 1: infrastructure scaffolding and sliding-window baseline. * Phase 2: language-specific chunkers (Python, Markdown, JSON/YAML, notebooks, RST). * Phase 3: semantic/embedding-driven chunking. * Phase 4: documentation, benchmarks, and Nancy Brain integration.