RAG
03·RAG·updated 2026-04-13

Chunking

Split a text into blocks to embed them individually in a RAG. Size and overlap are critical parameters that impact precision, cost and latency.

Chunking

TL;DR

Split a text into blocks to embed them individually in a RAG. Size and overlap are critical parameters that impact precision, cost and latency.

The historical problem

In 2023, context windows were small (GPT-3.5: 4k, GPT-4: 8k-32k tokens). Vector stores search by similarity between a query and pre-indexed documents. So we had to split documents into units small enough to:

  • Fit inside the LLM context along with the query
  • Have precise embeddings (one chunk = one "dense" idea)
  • Allow granular search

How it works

Classic pipeline:

  1. Load the document (PDF via PyPDF, web via LangChain loaders, etc.)
  2. Clean the text (nlp preprocessing: stopwords, regex, lower case)
  3. Split into chunks of size N with overlap M
  4. Compute the embedding of each chunk via an embedding model
  5. Index in a [[vector-stores|vector store]]

Splitting strategies:

  • Fixed-size: N tokens per chunk (simple but brutal)
  • Recursive: split by hierarchical separators (paragraphs, sentences, words)
  • Semantic: split on topic changes (via embeddings)

5 chunking strategies (Daily Dose DS)

Source: Daily Dose DS AI Engineering Guidebook (2025), section "5 chunking strategies for RAG".

1. Fixed-size chunking

Split text into uniform segments based on a pre-defined number of characters, words, or tokens. Maintain some overlap (10-20%) to avoid breaking ideas across chunks.

  • Pros: trivial to implement, equal chunk size simplifies batch processing, fast.
  • Cons: often cuts sentences or ideas in the middle. Important information gets distributed between chunks.
  • When: prototyping, benchmarking baselines, chat logs where structure is weak.

2. Semantic chunking

Segment based on meaningful units (sentences, paragraphs, topics). Create embeddings for each segment. Merge consecutive segments while their cosine similarity stays high. When similarity drops below a threshold, start a new chunk.

  • Pros: preserves complete ideas, produces richer embeddings, improves retrieval accuracy.
  • Cons: requires an embedding model at indexing time (extra cost), threshold is sensitive and varies per document type.
  • When: heterogeneous docs where topics shift (blog posts, books, mixed reports).

3. Recursive chunking

First chunk by inherent separators (paragraphs, sections). Then, for any chunk that still exceeds the size limit, recursively split it into smaller chunks using the next level of separators (sentences, then words).

  • Pros: respects natural structure while enforcing a size budget, pragmatic default.
  • Cons: extra implementation complexity, separator hierarchy must be tuned per language/format.
  • When: production default. LangChain's RecursiveCharacterTextSplitter is this.

4. Document structure-based chunking

Uses the inherent structure of documents (Markdown headings, HTML sections, PDF bookmarks, code blocks) to define chunk boundaries. Each logical section becomes one chunk.

  • Pros: preserves document hierarchy, chunks map to navigable units (ideal for citations).
  • Cons: assumes the document actually has clean structure (many do not). Chunks may exceed token limits on long sections.
  • When: well-formatted docs (API references, markdown knowledge bases, legal contracts). Combine with recursive fallback.

5. LLM-based chunking

Prompt an LLM to produce semantically isolated and meaningful chunks. The LLM reads the doc and returns the chunk boundaries based on meaning.

  • Pros: highest semantic accuracy. The model understands context beyond heuristics.
  • Cons: the most computationally demanding approach. LLM context limits cap how much you can process per call. Expensive at scale.
  • When: small high-value corpora (legal, medical, research), where retrieval quality outweighs indexing cost.

Quick picker

SituationStrategy
Ship today, reasonable accuracyRecursive
Blog posts, heterogeneous topicsSemantic
Well-structured docs (Markdown, API docs)Structure-based (+ recursive fallback)
High-stakes small corpusLLM-based
Log lines, chat transcriptsFixed-size

In 2026, the serious setup is: structure-based OR recursive, followed by Contextual Retrieval (Anthropic 2024, prepend a summary of the parent section to each chunk before embedding). This gives near-LLM-based quality at recursive-chunking cost.

Relevance today (2026)

Before (2023): chunking was critical because context was limited to 8-32k. Without chunking, RAG on large docs was impossible.

Today (2026):

  • Gemini 2.0: 2M token context
  • Claude Sonnet/Opus: 1M tokens (1M context)
  • GPT-4 Turbo: 128k
  • Per-token prices dropped by a factor of ~10 in 2 years

The killer question: does chunking still make sense ?

Nuanced answer:

  • No for "just fit the doc into the context": you can stuff several full docs
  • Yes for (1) cost: sending 2M tokens on every query = ruin, (2) vector search precision: a chunk too big gives a "blurry" embedding that matches poorly, (3) latency: more tokens sent = slower answer, (4) benchmarks: contextual retrieval and hybrid search still beat "stuff everything" on most eval sets

Trends 2024-2026 to watch:

  • Late chunking (2024): embed the whole doc with full attention, THEN chunk the embeddings. Produces better embeddings because each chunk "knows" its global context.
  • Contextual retrieval (Anthropic, 2024-09): add a contextual summary to each chunk before embedding, reduces retrieval failure by 49%.
  • Prompt caching (Anthropic, OpenAI): if you use long context, the cache makes reinjection less costly.

Critical conclusion: naive chunking (fixed-size 512 tokens) is obsolete for a pro setup. Serious teams in 2026 use recursive or semantic + contextual retrieval + rerank.

Critical questions

  • Why 512 tokens and not 100 or 2000 ? What is the trade-off ?
  • If the info sits across 2 chunks, what happens ? How do you mitigate ?
  • Overlap 10% vs 30%: impact on precision and storage cost ?
  • Same chunking for code and for prose: good or bad idea ?
  • In production, which metric do you watch to know if your chunking is good ?
  • If my user asks short questions and my docs are long, how do I adapt ?
  • Why is semantic chunking not always better in practice ?
  • If I move to a 1M context, do I still chunk ? Why ?

Production pitfalls

  • Chunks too small (< 100 tokens): insufficient context, partial answers, hallucinations
  • Chunks too large (> 2000 tokens): diluted embedding, vector search precision drops, the LLM "floods"
  • No overlap: a sentence cut across 2 chunks = lost info (neither chunk matches well)
  • Uniform for all types: a prose chunking applied to code breaks the semantics
  • Chunking with no metadata: impossible to trace where the info came from for the user
  • Re-chunking without reindexing: incoherent base

Alternatives / Comparisons

ApproachWhen ?Limits
Fixed-size chunkingFast prototypingOften breaks semantics
Recursive chunkingPro defaultSeparator configuration
Semantic chunkingHeterogeneous docsMore expensive (embeddings to decide)
Late chunkingWhen you want the bestNeeds recent models (Jina v3+)
Stuff-all-in-contextShort docs + 1M contextCost explodes at scale
Contextual retrievalPro production RAGAdds one LLM call per chunk at indexing

Mini-lab

[[labs/02-rag-with-chunking/]] - experiment with 3 chunk sizes (256, 512, 1024) + 3 overlaps (0%, 10%, 30%) and measure precision on a set of questions.

To create: /lab chunking.

Further reading

ragchunkingpreprocessing