RAG
03·RAG·updated 2026-04-13

Chunking

Split a text into blocks to embed them individually in a RAG. Size and overlap are critical parameters that impact precision, cost and latency.

Chunking

TL;DR

Split a text into blocks to embed them individually in a RAG. Size and overlap are critical parameters that impact precision, cost and latency.

The historical problem

In 2023, context windows were small (GPT-3.5: 4k, GPT-4: 8k-32k tokens). Vector stores search by similarity between a query and pre-indexed documents. So we had to split documents into units small enough to:

  • Fit inside the LLM context along with the query
  • Have precise embeddings (one chunk = one "dense" idea)
  • Allow granular search

How it works

Classic pipeline:

  1. Load the document (PDF via PyPDF, web via LangChain loaders, etc.)
  2. Clean the text (nlp preprocessing: stopwords, regex, lower case)
  3. Split into chunks of size N with overlap M
  4. Compute the embedding of each chunk via an embedding model
  5. Index in a [[vector-stores|vector store]]

Splitting strategies:

  • Fixed-size: N tokens per chunk (simple but brutal)
  • Recursive: split by hierarchical separators (paragraphs, sentences, words)
  • Semantic: split on topic changes (via embeddings)

Relevance today (2026)

Before (2023): chunking was critical because context was limited to 8-32k. Without chunking, RAG on large docs was impossible.

Today (2026):

  • Gemini 2.0: 2M token context
  • Claude Sonnet/Opus: 1M tokens (1M context)
  • GPT-4 Turbo: 128k
  • Per-token prices dropped by a factor of ~10 in 2 years

The killer question: does chunking still make sense ?

Nuanced answer:

  • No for "just fit the doc into the context": you can stuff several full docs
  • Yes for (1) cost: sending 2M tokens on every query = ruin, (2) vector search precision: a chunk too big gives a "blurry" embedding that matches poorly, (3) latency: more tokens sent = slower answer, (4) benchmarks: contextual retrieval and hybrid search still beat "stuff everything" on most eval sets

Trends 2024-2026 to watch:

  • Late chunking (2024): embed the whole doc with full attention, THEN chunk the embeddings. Produces better embeddings because each chunk "knows" its global context.
  • Contextual retrieval (Anthropic, 2024-09): add a contextual summary to each chunk before embedding, reduces retrieval failure by 49%.
  • Prompt caching (Anthropic, OpenAI): if you use long context, the cache makes reinjection less costly.

Critical conclusion: naive chunking (fixed-size 512 tokens) is obsolete for a pro setup. Serious teams in 2026 use recursive or semantic + contextual retrieval + rerank.

Critical questions

  • Why 512 tokens and not 100 or 2000 ? What is the trade-off ?
  • If the info sits across 2 chunks, what happens ? How do you mitigate ?
  • Overlap 10% vs 30%: impact on precision and storage cost ?
  • Same chunking for code and for prose: good or bad idea ?
  • In production, which metric do you watch to know if your chunking is good ?
  • If my user asks short questions and my docs are long, how do I adapt ?
  • Why is semantic chunking not always better in practice ?
  • If I move to a 1M context, do I still chunk ? Why ?

Production pitfalls

  • Chunks too small (< 100 tokens): insufficient context, partial answers, hallucinations
  • Chunks too large (> 2000 tokens): diluted embedding, vector search precision drops, the LLM "floods"
  • No overlap: a sentence cut across 2 chunks = lost info (neither chunk matches well)
  • Uniform for all types: a prose chunking applied to code breaks the semantics
  • Chunking with no metadata: impossible to trace where the info came from for the user
  • Re-chunking without reindexing: incoherent base

Alternatives / Comparisons

ApproachWhen ?Limits
Fixed-size chunkingFast prototypingOften breaks semantics
Recursive chunkingPro defaultSeparator configuration
Semantic chunkingHeterogeneous docsMore expensive (embeddings to decide)
Late chunkingWhen you want the bestNeeds recent models (Jina v3+)
Stuff-all-in-contextShort docs + 1M contextCost explodes at scale
Contextual retrievalPro production RAGAdds one LLM call per chunk at indexing

Mini-lab

[[labs/02-rag-with-chunking/]] - experiment with 3 chunk sizes (256, 512, 1024) + 3 overlaps (0%, 10%, 30%) and measure precision on a set of questions.

To create: /lab chunking.

Further reading

  • Paper "Late Chunking" (Jina AI, 2024)
  • Contextual Retrieval (Anthropic, 2024): https://www.anthropic.com/news/contextual-retrieval
  • LangChain RecursiveCharacterTextSplitter vs SemanticChunker
  • RAGAS / LlamaIndex eval benchmarks on different strategies
  • Hierarchical chunking (parent-child) for traceability
ragchunkingpreprocessing