Chunking
Split a text into blocks to embed them individually in a RAG. Size and overlap are critical parameters that impact precision, cost and latency.
Chunking
TL;DR
Split a text into blocks to embed them individually in a RAG. Size and overlap are critical parameters that impact precision, cost and latency.
The historical problem
In 2023, context windows were small (GPT-3.5: 4k, GPT-4: 8k-32k tokens). Vector stores search by similarity between a query and pre-indexed documents. So we had to split documents into units small enough to:
- Fit inside the LLM context along with the query
- Have precise embeddings (one chunk = one "dense" idea)
- Allow granular search
How it works
Classic pipeline:
- Load the document (PDF via PyPDF, web via LangChain loaders, etc.)
- Clean the text (nlp preprocessing: stopwords, regex, lower case)
- Split into chunks of size N with overlap M
- Compute the embedding of each chunk via an embedding model
- Index in a [[vector-stores|vector store]]
Splitting strategies:
- Fixed-size: N tokens per chunk (simple but brutal)
- Recursive: split by hierarchical separators (paragraphs, sentences, words)
- Semantic: split on topic changes (via embeddings)
Relevance today (2026)
Before (2023): chunking was critical because context was limited to 8-32k. Without chunking, RAG on large docs was impossible.
Today (2026):
- Gemini 2.0: 2M token context
- Claude Sonnet/Opus: 1M tokens (1M context)
- GPT-4 Turbo: 128k
- Per-token prices dropped by a factor of ~10 in 2 years
The killer question: does chunking still make sense ?
Nuanced answer:
- No for "just fit the doc into the context": you can stuff several full docs
- Yes for (1) cost: sending 2M tokens on every query = ruin, (2) vector search precision: a chunk too big gives a "blurry" embedding that matches poorly, (3) latency: more tokens sent = slower answer, (4) benchmarks: contextual retrieval and hybrid search still beat "stuff everything" on most eval sets
Trends 2024-2026 to watch:
- Late chunking (2024): embed the whole doc with full attention, THEN chunk the embeddings. Produces better embeddings because each chunk "knows" its global context.
- Contextual retrieval (Anthropic, 2024-09): add a contextual summary to each chunk before embedding, reduces retrieval failure by 49%.
- Prompt caching (Anthropic, OpenAI): if you use long context, the cache makes reinjection less costly.
Critical conclusion: naive chunking (fixed-size 512 tokens) is obsolete for a pro setup. Serious teams in 2026 use recursive or semantic + contextual retrieval + rerank.
Critical questions
- Why 512 tokens and not 100 or 2000 ? What is the trade-off ?
- If the info sits across 2 chunks, what happens ? How do you mitigate ?
- Overlap 10% vs 30%: impact on precision and storage cost ?
- Same chunking for code and for prose: good or bad idea ?
- In production, which metric do you watch to know if your chunking is good ?
- If my user asks short questions and my docs are long, how do I adapt ?
- Why is semantic chunking not always better in practice ?
- If I move to a 1M context, do I still chunk ? Why ?
Production pitfalls
- Chunks too small (< 100 tokens): insufficient context, partial answers, hallucinations
- Chunks too large (> 2000 tokens): diluted embedding, vector search precision drops, the LLM "floods"
- No overlap: a sentence cut across 2 chunks = lost info (neither chunk matches well)
- Uniform for all types: a prose chunking applied to code breaks the semantics
- Chunking with no metadata: impossible to trace where the info came from for the user
- Re-chunking without reindexing: incoherent base
Alternatives / Comparisons
| Approach | When ? | Limits |
|---|---|---|
| Fixed-size chunking | Fast prototyping | Often breaks semantics |
| Recursive chunking | Pro default | Separator configuration |
| Semantic chunking | Heterogeneous docs | More expensive (embeddings to decide) |
| Late chunking | When you want the best | Needs recent models (Jina v3+) |
| Stuff-all-in-context | Short docs + 1M context | Cost explodes at scale |
| Contextual retrieval | Pro production RAG | Adds one LLM call per chunk at indexing |
Mini-lab
[[labs/02-rag-with-chunking/]] - experiment with 3 chunk sizes (256, 512, 1024) + 3 overlaps (0%, 10%, 30%) and measure precision on a set of questions.
To create: /lab chunking.
Further reading
- Paper "Late Chunking" (Jina AI, 2024)
- Contextual Retrieval (Anthropic, 2024): https://www.anthropic.com/news/contextual-retrieval
- LangChain
RecursiveCharacterTextSplittervsSemanticChunker - RAGAS / LlamaIndex eval benchmarks on different strategies
- Hierarchical chunking (parent-child) for traceability