Chunking

TL;DR

Split a text into blocks to embed them individually in a RAG. Size and overlap are critical parameters that impact precision, cost and latency.

The historical problem

In 2023, context windows were small (GPT-3.5: 4k, GPT-4: 8k-32k tokens). Vector stores search by similarity between a query and pre-indexed documents. So we had to split documents into units small enough to:

Fit inside the LLM context along with the query
Have precise embeddings (one chunk = one "dense" idea)
Allow granular search

How it works

Classic pipeline:

Load the document (PDF via PyPDF, web via LangChain loaders, etc.)
Clean the text (nlp preprocessing: stopwords, regex, lower case)
Split into chunks of size N with overlap M
Compute the embedding of each chunk via an embedding model
Index in a [[vector-stores|vector store]]

Splitting strategies:

Fixed-size: N tokens per chunk (simple but brutal)
Recursive: split by hierarchical separators (paragraphs, sentences, words)
Semantic: split on topic changes (via embeddings)

5 chunking strategies (Daily Dose DS)

Source: Daily Dose DS AI Engineering Guidebook (2025), section "5 chunking strategies for RAG".

1. Fixed-size chunking

Split text into uniform segments based on a pre-defined number of characters, words, or tokens. Maintain some overlap (10-20%) to avoid breaking ideas across chunks.

Pros: trivial to implement, equal chunk size simplifies batch processing, fast.
Cons: often cuts sentences or ideas in the middle. Important information gets distributed between chunks.
When: prototyping, benchmarking baselines, chat logs where structure is weak.

2. Semantic chunking

Segment based on meaningful units (sentences, paragraphs, topics). Create embeddings for each segment. Merge consecutive segments while their cosine similarity stays high. When similarity drops below a threshold, start a new chunk.

Pros: preserves complete ideas, produces richer embeddings, improves retrieval accuracy.
Cons: requires an embedding model at indexing time (extra cost), threshold is sensitive and varies per document type.
When: heterogeneous docs where topics shift (blog posts, books, mixed reports).

3. Recursive chunking

First chunk by inherent separators (paragraphs, sections). Then, for any chunk that still exceeds the size limit, recursively split it into smaller chunks using the next level of separators (sentences, then words).

Pros: respects natural structure while enforcing a size budget, pragmatic default.
Cons: extra implementation complexity, separator hierarchy must be tuned per language/format.
When: production default. LangChain's RecursiveCharacterTextSplitter is this.

4. Document structure-based chunking

Uses the inherent structure of documents (Markdown headings, HTML sections, PDF bookmarks, code blocks) to define chunk boundaries. Each logical section becomes one chunk.

Pros: preserves document hierarchy, chunks map to navigable units (ideal for citations).
Cons: assumes the document actually has clean structure (many do not). Chunks may exceed token limits on long sections.
When: well-formatted docs (API references, markdown knowledge bases, legal contracts). Combine with recursive fallback.

5. LLM-based chunking

Prompt an LLM to produce semantically isolated and meaningful chunks. The LLM reads the doc and returns the chunk boundaries based on meaning.

Pros: highest semantic accuracy. The model understands context beyond heuristics.
Cons: the most computationally demanding approach. LLM context limits cap how much you can process per call. Expensive at scale.
When: small high-value corpora (legal, medical, research), where retrieval quality outweighs indexing cost.

Quick picker

Situation	Strategy
Ship today, reasonable accuracy	Recursive
Blog posts, heterogeneous topics	Semantic
Well-structured docs (Markdown, API docs)	Structure-based (+ recursive fallback)
High-stakes small corpus	LLM-based
Log lines, chat transcripts	Fixed-size

In 2026, the serious setup is: structure-based OR recursive, followed by Contextual Retrieval (Anthropic 2024, prepend a summary of the parent section to each chunk before embedding). This gives near-LLM-based quality at recursive-chunking cost.

Relevance today (2026)

Before (2023): chunking was critical because context was limited to 8-32k. Without chunking, RAG on large docs was impossible.

Today (2026):

Gemini 2.0: 2M token context
Claude Sonnet/Opus: 1M tokens (1M context)
GPT-4 Turbo: 128k
Per-token prices dropped by a factor of ~10 in 2 years

The killer question: does chunking still make sense ?

Nuanced answer:

No for "just fit the doc into the context": you can stuff several full docs
Yes for (1) cost: sending 2M tokens on every query = ruin, (2) vector search precision: a chunk too big gives a "blurry" embedding that matches poorly, (3) latency: more tokens sent = slower answer, (4) benchmarks: contextual retrieval and hybrid search still beat "stuff everything" on most eval sets

Trends 2024-2026 to watch:

Late chunking (2024): embed the whole doc with full attention, THEN chunk the embeddings. Produces better embeddings because each chunk "knows" its global context.
Contextual retrieval (Anthropic, 2024-09): add a contextual summary to each chunk before embedding, reduces retrieval failure by 49%.
Prompt caching (Anthropic, OpenAI): if you use long context, the cache makes reinjection less costly.

Critical conclusion: naive chunking (fixed-size 512 tokens) is obsolete for a pro setup. Serious teams in 2026 use recursive or semantic + contextual retrieval + rerank.

Critical questions

Why 512 tokens and not 100 or 2000 ? What is the trade-off ?
If the info sits across 2 chunks, what happens ? How do you mitigate ?
Overlap 10% vs 30%: impact on precision and storage cost ?
Same chunking for code and for prose: good or bad idea ?
In production, which metric do you watch to know if your chunking is good ?
If my user asks short questions and my docs are long, how do I adapt ?
Why is semantic chunking not always better in practice ?
If I move to a 1M context, do I still chunk ? Why ?

Production pitfalls

Chunks too small (< 100 tokens): insufficient context, partial answers, hallucinations
Chunks too large (> 2000 tokens): diluted embedding, vector search precision drops, the LLM "floods"
No overlap: a sentence cut across 2 chunks = lost info (neither chunk matches well)
Uniform for all types: a prose chunking applied to code breaks the semantics
Chunking with no metadata: impossible to trace where the info came from for the user
Re-chunking without reindexing: incoherent base

Alternatives / Comparisons

Approach	When ?	Limits
Fixed-size chunking	Fast prototyping	Often breaks semantics
Recursive chunking	Pro default	Separator configuration
Semantic chunking	Heterogeneous docs	More expensive (embeddings to decide)
Late chunking	When you want the best	Needs recent models (Jina v3+)
Stuff-all-in-context	Short docs + 1M context	Cost explodes at scale
Contextual retrieval	Pro production RAG	Adds one LLM call per chunk at indexing

Mini-lab

[[labs/02-rag-with-chunking/]] - experiment with 3 chunk sizes (256, 512, 1024) + 3 overlaps (0%, 10%, 30%) and measure precision on a set of questions.

To create: /lab chunking.

Chunking

Chunking

TL;DR

The historical problem

How it works

5 chunking strategies (Daily Dose DS)

1. Fixed-size chunking

2. Semantic chunking

3. Recursive chunking

4. Document structure-based chunking

5. LLM-based chunking

Quick picker

Relevance today (2026)

Critical questions

Production pitfalls

Alternatives / Comparisons

Mini-lab

Further reading