Overview¶
Chunky exposes a modular pipeline for converting heterogeneous project artefacts into well-behaved text chunks. The pipeline is language-aware, pluggable, and ready for Nancy Brain’s MCP-backed retrieval workflows.
Note
See design/CHUNKY_V2_SPEC.md and design/CHUNK_MERGE_SPEC.md for
implemented v2/v2.1 behavior.
design/SEMANTIC_CHUNKER.md is retained as an archival early design draft.
Getting Started¶
Install the package from PyPI or from source:
pip install chunky-files
git clone https://github.com/AmberLee2427/chunky.git
cd chunky
pip install .
For development work and documentation builds:
pip install -e ".[dev,docs]"
First chunks via the pipeline:
from pathlib import Path
from chunky import ChunkPipeline, ChunkerConfig
pipeline = ChunkPipeline()
config = ChunkerConfig(
max_chars=1000,
min_chunk_chars=80, # forward-merge tiny chunks into successor chunks
lines_per_chunk=40,
line_overlap=5,
)
chunks = pipeline.chunk_file(Path("/path/to/file.py"), config=config)
for chunk in chunks:
print(chunk.chunk_id, chunk.metadata["line_start"], chunk.metadata["line_end"])
Built-in chunkers¶
PythonSemanticChunker— splits modules on top-level functions/classes and captures remaining context.MarkdownHeadingChunker— groups content per heading while keeping introductory prose.JSONYamlChunker— slices structured configs by their first-level keys/items and falls back if parsing fails.PlainTextChunker— groups blank-line separated paragraphs before falling back to sliding windows.FortranChunker— captures program, subroutine, and function blocks with minimal heuristics.RSTChunker— detects reStructuredText heading sections and chunks by section boundaries.NotebookChunker— groups nb4llm notebook exports (.nb.txt) into markdown+code context chunks.Tree-sitter chunkers (optional extra) for C/C++/HTML/Bash when the tree extra is installed, with gap-filling so uncaptured lines are still emitted.
SlidingWindowChunker— deterministic line windows with configurable overlap.
Chunk identifiers default to <doc_id>#chunk-0000. Provide Document.metadata['doc_id'] (or set
ChunkerConfig.doc_id_key) and adjust the suffix with ChunkerConfig.chunk_id_template to suit your
downstream needs.
Forward-merge behavior¶
Set ChunkerConfig.min_chunk_chars to a positive integer to merge tiny chunks into
their successor chunk (or into the predecessor for trailing tiny chunks). This keeps
small but meaningful context (imports, decorators, short comments/docstrings) attached
to nearby semantic content.
Roadmap¶
Phase 1: infrastructure scaffolding and sliding-window baseline.
Phase 2: language-specific chunkers (Python, Markdown, JSON/YAML, notebooks, RST).
Phase 3: semantic/embedding-driven chunking.
Phase 4: documentation, benchmarks, and Nancy Brain integration.