Date: March 2026
Status: Active
Context
After converting documents to Markdown with Docling, we need to split them into chunks for embedding and retrieval. The chunk size directly impacts retrieval quality.
Options considered
| Chunk size | Overlap | Pros | Cons |
|---|
| 200 tokens | 20 | Very precise retrieval | Too small — loses context, more chunks to embed |
| 500 tokens | 50 | Good balance of precision and context | May split some ideas across chunks |
| 1000 tokens | 100 | Rich context per chunk | Less precise — chunk may contain irrelevant parts |
| Full document | 0 | Complete context | Too large for embedding, single vector per doc |
Decision: 500 tokens / 50 overlap
Why 500 tokens
- ~375 words — roughly one topic or paragraph
- Large enough to contain a complete idea
- Small enough that the embedding captures a specific concept (not a blurry mix of many topics)
- Fits well within embedding model input limits
- Standard recommendation in most RAG guides and frameworks
Why 50 token overlap
- 10% of chunk size — prevents cutting a sentence in half at chunk boundaries
- If a key sentence falls right at the edge of a chunk, it appears in both the current and next chunk
- Small enough to avoid excessive duplication (only ~37 extra words per chunk)
Splitter choice: RecursiveCharacterTextSplitter
Uses tiktoken (cl100k_base tokenizer - the tool that splits text into tokens) to measure in tokens, not characters. Tries break points in order:
\n\n (paragraph break — best split point)
\n (line break)
(space — last resort)
This keeps paragraphs intact when possible.
Impact
With our 2 PDF books:
- Total text: ~550k characters
- Chunks created: 804
- Average chunk: ~375 words
Future experiment
Try different chunk sizes (200, 500, 1000) and compare retrieval quality with the same set of test questions. Save results in docs/benchmarks/.