RAG
03·RAG·updated 2026-04-19

RAG Workflow (end-to-end pipeline)

RAG (Retrieval-Augmented Generation) wires an LLM to a private knowledge source. The standard pipeline is 8 steps: chunk, embed, store, query, embed query, retrieve top-k, rerank, generate. Each step is a failure point. Mastering RAG means mastering all eight.

RAG Workflow (end-to-end pipeline)

Watch or read first

TL;DR

RAG (Retrieval-Augmented Generation) wires an LLM to a private knowledge source. The standard pipeline is 8 steps: chunk, embed, store, query, embed query, retrieve top-k, rerank, generate. Each step is a failure point. Mastering RAG means mastering all eight.

The historical problem

LLMs are frozen at training time. They do not know:

  • Anything after their cutoff date
  • Your company's private documents
  • Your user's personal data
  • Live data (stock prices, weather, inventory)

Retraining the model on new data every day is impractical: LLMs take weeks and millions of dollars to train.

You could stuff new info in the prompt, but context windows were tiny in 2022-2023 (4k, 8k, 32k tokens). Even in 2026 with 1M-token context, stuffing everything costs too much to scale and dilutes attention.

RAG (introduced by Lewis et al., Facebook AI, 2020) splits the problem: store everything in a [[vector-databases|vector database]], retrieve only the relevant bits per query, and stuff those bits in the prompt.

How it works

The 8-step pipeline

  INDEXING (offline)               QUERYING (online per request)
  ----------------------------     --------------------------------
  1. Chunk documents               4. User query arrives
  2. Embed each chunk              5. Embed the query (same model)
  3. Store in vector DB            6. Retrieve top-k similar chunks
                                   7. (Optional) Rerank top-k
                                   8. Generate answer with LLM

Pipeline visualized

Step 1: Chunk documents

Split large documents into smaller pieces that fit the embedding model's input size and produce tight, semantic embeddings.

See chunking for the 5 strategies (fixed-size, semantic, recursive, document-structure-based, LLM-based).

Step 2: Embed each chunk

Pass each chunk through an embedding model (OpenAI text-embedding-3-small, Cohere embed-multilingual-v3, Voyage voyage-3, open-source BGE-M3, etc.). Output: a vector of dimension 512-3072.

These are context embeddings (bi-encoder style), not word embeddings. The whole chunk is one vector.

Step 3: Store in a vector database

Dump the vectors plus their original text (payload) and metadata (source, date, author, tags) into a [[vector-databases|vector store]] (Pinecone, Qdrant, Weaviate, Chroma, pgvector, Milvus, Elastic).

Step 4: User query arrives

A user asks a question in natural language.

Step 5: Embed the query

Use the SAME embedding model as Step 2 to get a query vector. This is critical: different embedding models live in different spaces and their vectors cannot be compared.

Step 6: Retrieve top-k similar chunks

Approximate nearest neighbor search (HNSW, IVF, ScaNN) finds the k chunks whose vectors are closest to the query vector (cosine similarity, dot product, or L2).

Typical k: 5-20 for direct generation, 50-100 if you rerank.

Step 7: (Optional) Rerank

The initial top-k is based on dense similarity alone. A cross-encoder reranker (Cohere Rerank, BGE-reranker, Jina Reranker) looks at the query plus each chunk and assigns a more precise relevance score. Re-sort, keep the top-N (e.g., top-5 of 50).

Many starter RAGs skip this step. Production RAGs almost always include it.

Step 8: Generate the final response

Feed the original query plus the top chunks to the LLM in a prompt template:

You are a helpful assistant. Answer the user's question using ONLY
the context below. If the context does not contain the answer, say so.

Context:
[chunk 1 ...]
[chunk 2 ...]
[chunk 3 ...]

Question: {user_query}

Answer:

The LLM synthesizes an answer grounded in the retrieved context.

Visual pipeline

  Indexing (run once, refresh on update):

  Documents --> Chunker --> Embedding Model --> Vector DB
                              (text-embed)      (Pinecone, etc.)
                                                 + metadata
                                                 + payload

  Querying (every user request):

  User Query
     |
     v
  Embedding Model (same as indexing)
     |
     v
  Vector DB --> top-k chunks by cosine similarity
     |
     v
  Reranker (cross-encoder)  [optional]
     |
     v
  top-N refined chunks
     |
     v
  Prompt Template + LLM --> Answer

Relevance today (2026)

1. RAG did NOT die with 1M-token contexts

In 2023, with 8k context, RAG was the only way to work with large docs. By 2026, Claude 1M, Gemini 2M, GPT-4.1 1M made some people declare "just stuff the whole doc". Wrong for three reasons:

  • Cost: 1M tokens per query at $3/1M = $3 per query. Not scalable.
  • Latency: prefill on 1M tokens is slow. Users do not wait.
  • Precision: needle-in-haystack tests show attention degrades at extreme context lengths.

RAG is still the default for serious production.

2. Contextual Retrieval (Anthropic, 2024)

A single big improvement: before embedding a chunk, prepend a short context ("This chunk is from the 2023 annual report, Risk Factors section, discussing currency hedging"). This extra context is generated by a cheap LLM pass at indexing time. Anthropic reported 49% fewer retrieval failures.

Most 2026 production RAGs use Contextual Retrieval or a variant.

3. Hybrid search is now standard

Dense vectors miss exact matches (product codes, names, rare terms). Hybrid retrieval combines dense (vector) + sparse (BM25) with reciprocal rank fusion. Almost every serious vector DB supports it natively in 2026.

4. REFRAG and CAG (2025)

  • REFRAG (Meta, 2025): compress chunks into single vectors, filter with an RL policy, expand only the selected ones. 30x faster time-to-first-token, 16x bigger context.
  • CAG (Cache-Augmented Generation): put stable knowledge in KV cache (persistent), use RAG for volatile data. Hybrid RAG+CAG.

See refrag and cag.

5. Agentic RAG replaces static RAG for complex queries

Traditional RAG retrieves once and generates once. Agentic RAG decides whether to retrieve, which source, how many times, and validates the answer. See agentic rag.

6. Embedding models got much better

  • 2023: OpenAI text-embedding-ada-002, 1536 dims
  • 2024: OpenAI text-embedding-3-large, 3072 dims, multilingual
  • 2024: Voyage-3, Cohere v3.5, BGE-M3 - strong open-source options
  • 2025-2026: domain-specific embeddings (legal, medical, code) outperform general ones on their domain by 10-30%

Picking the right embedding model is a first-order decision. Benchmark on MTEB and your own data.

Critical questions

  • Why use the same embedding model for indexing and querying? What breaks otherwise?
  • Why top-k = 5 sometimes, top-k = 50 other times? What drives the choice?
  • Why is a reranker usually a cross-encoder while retrieval uses bi-encoders? What is the trade-off?
  • If you re-chunk your docs, what must you update? (Re-index everything. Old embeddings are stale.)
  • With 1M-token context, when is it still worth chunking? (When cost per query matters, when precision matters, when your corpus exceeds the context anyway.)
  • Why do you add the original query text at the end of the LLM prompt, not just the chunks? (So the LLM knows what to answer.)
  • Why include metadata in the LLM context? (Citations, grounding, traceability for the user.)

Production pitfalls

  • Embedding drift. You upgraded the embedding model, forgot to re-embed historical data. Mixed-space search = garbage results.
  • Chunks without metadata. User asks "where is this from?". You cannot answer. Always store source URL, section, date.
  • No reranking. Initial top-5 contains irrelevant fluff. LLM either ignores it or, worse, quotes it.
  • Prompt injection via retrieved docs. A malicious doc in your corpus instructs the LLM to ignore guidelines. Sanitize retrieved content, use content delimiters, audit your corpus.
  • Cold start on a new knowledge base. Embedding 10M chunks takes hours and costs money. Batch, use cheaper embeddings first, upgrade later.
  • No eval set. You tweak chunk size, top-k, reranker, and have no measure. Build a golden question/answer set early. RAGAS, TruLens, or home-grown.
  • Bad base model. RAG cannot fix a weak LLM. If the LLM hallucinates, retrieval alone will not save you. Pick a strong model for the generation step.
  • Stuffing too many chunks. Adding more chunks hurts after ~10-20. Attention dilutes. Measure per-chunk marginal utility.

Alternatives / Comparisons

See rag architectures for the 8 main patterns. Also see:

ApproachWhen to use
Naive RAGMVP, simple Q&A
Hybrid search RAGNames, codes, rare terms matter
Contextual RetrievalProduction, better precision
Agentic RAGMulti-step queries, tool use
REFRAGLong context, latency-sensitive
CAGStatic corpus, volatile model
Stuff-all-in-contextShort docs, low QPS, 1M model
Fine-tuningDomain vocabulary, style change, no new facts

Mental parallels (non-AI)

  • Library + librarian: the vector DB is the library. The embedding model is the librarian's mental map of "books close to each other". Your query is a reader's request. The LLM is a researcher who reads the books the librarian brings and writes the report.
  • Legal case research: a lawyer retrieves relevant prior cases (retrieval), reads them (context), and composes an argument (generation).
  • Search engine: Google does RAG at planet scale. Query -> index retrieval -> ranking -> snippet generation. Replace "snippet" with "full answer" and you have modern RAG.
  • Chef and pantry: the corpus is the pantry. The recipe (query) tells the chef (LLM) which ingredients to grab. Without a good pantry layout (vector DB), cooking is slow and prone to wrong ingredients.

Mini-lab

labs/rag-workflow/ (to create):

  1. Build a RAG over the Daily Dose DS AI Engineering Guidebook itself.
  2. Use:
    • pypdf to extract text
    • LangChain RecursiveCharacterTextSplitter (1000/200)
    • text-embedding-3-small
    • Qdrant local
    • Cohere Rerank 3
    • Claude Haiku 4.5 for generation
  3. Write a golden eval set of 20 Q&A pairs.
  4. Measure precision@5, answer accuracy, latency, cost.
  5. Then swap in Contextual Retrieval and compare.

Stack: uv, langchain, qdrant-client, anthropic.

Further reading

Canonical

Related in this KB

Tools

ragpipelineworkflowretrievalgenerationarchitecture