Overview¶

Chunky exposes a modular pipeline for converting heterogeneous project artefacts into well-behaved text chunks. The pipeline is language-aware, pluggable, and ready for Nancy Brain’s MCP-backed retrieval workflows.

Note

See design/CHUNKY_V2_SPEC.md and design/CHUNK_MERGE_SPEC.md for implemented v2/v2.1 behavior. design/SEMANTIC_CHUNKER.md is retained as an archival early design draft.

Getting Started¶

Install the package from PyPI or from source:

pip install chunky-files

from source¶

git clone https://github.com/AmberLee2427/chunky.git
cd chunky
pip install .

For development work and documentation builds:

pip install -e ".[dev,docs]"

First chunks via the pipeline:

from pathlib import Path

from chunky import ChunkPipeline, ChunkerConfig

pipeline = ChunkPipeline()
config = ChunkerConfig(
    max_chars=1000,
    min_chunk_chars=80,  # forward-merge tiny chunks into successor chunks
    lines_per_chunk=40,
    line_overlap=5,
)
chunks = pipeline.chunk_file(Path("/path/to/file.py"), config=config)

for chunk in chunks:
    print(chunk.chunk_id, chunk.metadata["line_start"], chunk.metadata["line_end"])

Built-in chunkers¶

PythonSemanticChunker — splits modules on top-level functions/classes and captures remaining context.
MarkdownHeadingChunker — groups content per heading while keeping introductory prose.
JSONYamlChunker — slices structured configs by their first-level keys/items and falls back if parsing fails.
PlainTextChunker — groups blank-line separated paragraphs before falling back to sliding windows.
FortranChunker — captures program, subroutine, and function blocks with minimal heuristics.
RSTChunker — detects reStructuredText heading sections and chunks by section boundaries.
NotebookChunker — groups nb4llm notebook exports (.nb.txt) into markdown+code context chunks.
Tree-sitter chunkers (optional extra) for C/C++/HTML/Bash when the tree extra is installed, with gap-filling so uncaptured lines are still emitted.
SlidingWindowChunker — deterministic line windows with configurable overlap.

Chunk identifiers default to <doc_id>#chunk-0000. Provide Document.metadata['doc_id'] (or set ChunkerConfig.doc_id_key) and adjust the suffix with ChunkerConfig.chunk_id_template to suit your downstream needs.

Forward-merge behavior¶

Set ChunkerConfig.min_chunk_chars to a positive integer to merge tiny chunks into their successor chunk (or into the predecessor for trailing tiny chunks). This keeps small but meaningful context (imports, decorators, short comments/docstrings) attached to nearby semantic content.

Roadmap¶

Phase 1: infrastructure scaffolding and sliding-window baseline.
Phase 2: language-specific chunkers (Python, Markdown, JSON/YAML, notebooks, RST).
Phase 3: semantic/embedding-driven chunking.
Phase 4: documentation, benchmarks, and Nancy Brain integration.