RAG
03·RAG·updated 2026-04-19

RAG Architectures (the 8 main patterns)

RAG is not one architecture but a family. Daily Dose DS lists 8 common patterns: Naive, Multimodal, HyDE, Corrective, Graph, Hybrid, Adaptive, Agentic. Each fixes a specific failure of Naive RAG. Pick by analyzing your data shape, query shape, and latency budget.

RAG Architectures (the 8 main patterns)

Watch or read first

TL;DR

RAG is not one architecture but a family. Daily Dose DS lists 8 common patterns: Naive, Multimodal, HyDE, Corrective, Graph, Hybrid, Adaptive, Agentic. Each fixes a specific failure of Naive RAG. Pick by analyzing your data shape, query shape, and latency budget.

The historical problem

Naive RAG works for simple fact lookup but breaks on:

  • Questions worded very differently from answers (semantic gap)
  • Multi-hop queries that need 2+ retrieval steps
  • Relationships between entities (graphs, not documents)
  • Mixed modalities (text + images + tables)
  • Low-quality retrieved chunks that mislead the LLM
  • Queries that do not need retrieval at all

Each architecture below addresses one or more of these failures.

How it works: the 8 patterns

1. Naive RAG

query --> embed --> search vector DB --> top-k --> stuff in prompt --> LLM --> answer

Simple vector similarity between query and stored chunks. Works for direct factual Q&A on a homogeneous corpus.

When: MVP, simple knowledge bases, "what is X" queries. Breaks when: queries are complex, wording mismatches, multi-hop.

2. Multimodal RAG

Embed and retrieve across text, image, audio, video using models like CLIP, Nomic-Embed-Multimodal, Voyage Multimodal.

text query --> multimodal embed --> search mixed index --> chunks + images --> multimodal LLM --> answer

When: product search with photos, medical imaging + notes, video understanding. Breaks when: your base model cannot reason across modalities well.

3. HyDE (Hypothetical Document Embeddings)

The insight: a question is not semantically similar to its answer. "How does attention work?" does not look like a paragraph describing attention.

HyDE fix: ask the LLM to hallucinate a hypothetical answer first, embed THAT, and search. The hallucinated answer is closer in embedding space to real answers than the raw question.

query --> LLM generates fake answer --> embed fake answer --> search --> real chunks --> LLM --> final answer

See hyde for the deep dive.

When: retrieval recall is low, queries are short questions. Breaks when: hallucinations are too off-topic, latency matters.

4. Corrective RAG

Validate retrieved chunks against trusted sources before sending to the LLM.

query --> retrieve --> check relevance --> if bad, re-retrieve or web search --> LLM

Common pattern: a small classifier scores retrieved chunks, below threshold triggers a fallback (web search, different index, human).

When: high-stakes domains (legal, medical, finance). Freshness matters. Breaks when: the classifier is unreliable or web search is noisy.

5. Graph RAG

Convert retrieved content into a knowledge graph (entities + relationships). The LLM sees both the raw text AND the graph structure.

documents --> entity + relation extraction --> knowledge graph --> query-guided traversal --> LLM

Microsoft's GraphRAG (2024) is the reference implementation.

When: relational queries ("who worked with X and founded Y?"), complex reasoning over entities. Breaks when: data is not entity-centric, graph construction is expensive.

6. Hybrid RAG

Combines dense vector retrieval with sparse retrieval (BM25) OR with graph retrieval in one pipeline.

query --> dense search (vectors) + sparse search (BM25) --> RRF or weighted fusion --> top-k --> LLM

In 2026 this is the DEFAULT for production RAG, not an exotic option. Most vector DBs have it built in.

When: mixed query types (exact terms + concepts), names, codes matter. Breaks when: tuning weights between dense and sparse is non-trivial.

7. Adaptive RAG

Dynamically decides whether a query needs retrieval at all, and how many steps.

query --> classifier -->
  simple fact   --> direct LLM
  single-hop    --> Naive RAG
  multi-hop     --> iterative retrieval with sub-queries

Often implemented as an [[../05-ai-agents/react-pattern|agentic]] system that routes.

When: mixed traffic (some queries need RAG, some do not), cost optimization. Breaks when: the classifier misroutes and the user gets a weak answer.

8. Agentic RAG

The LLM itself is an agent that decides retrieval strategy on the fly: which source, how many times, when to stop.

user query --> agent loop:
  plan --> retrieve --> evaluate --> re-plan if needed --> answer

See agentic rag for the deep dive.

When: complex workflows, multi-tool, multi-source (vector DB + SQL + web + APIs). Breaks when: agents loop or cost explodes.

Bonus architectures (not in Daily Dose DS's list of 8)

REFRAG (Meta, 2025): compress chunks into single vectors, RL policy selects which to expand. See refrag and cag.

CAG (Cache-Augmented Generation): put stable context in the KV cache, skip retrieval for it. Hybrid RAG+CAG. See refrag and cag.

Contextual Retrieval (Anthropic, 2024): prepend a context summary to each chunk before embedding. Not really a separate architecture, more a chunking upgrade, but gives huge gains.

Relevance today (2026)

The "8 architectures" frame is didactic, not exclusive

Real production systems mix these. A typical 2026 prod RAG is:

  • Hybrid (dense + BM25) + Contextual Retrieval + Reranker + Agentic routing for complex queries + CAG for static policy docs.

That's 5 of the 8 in one system.

The winners of 2024-2026

  • Hybrid search: everybody uses it.
  • Contextual Retrieval: big wins, low cost.
  • Reranking with cross-encoders: standard.
  • Agentic RAG: rising fast as reasoning models get cheap.
  • Graph RAG: niche, heavy, but strong for the right use case.

The losers

  • Naive RAG in prod: a 2023 artifact. If your prod is naive, you have homework.
  • HyDE alone: rarely worth the extra LLM call. Often beaten by a better embedding model + reranker.
  • Corrective RAG: partly absorbed into Agentic RAG.

The emerging

  • REFRAG: compresses retrieval by 30x latency, still early.
  • ColBERT / ColPali: late-interaction retrieval (multi-vector per doc). Strong on long docs and visual RAG.
  • Cache-Augmented Generation (CAG): practical for stable corpora with frequent queries.

Decision matrix (2026)

Your situationStart with
Prototype, small corpusNaive + reranker
Production Q&A on docsHybrid + Contextual + reranker
Multi-source, tool useAgentic RAG
Entities, relationshipsGraph RAG
Images + textMultimodal RAG
Static policies, high QPSRAG + CAG
Latency-critical, long contextREFRAG
Queries differ from answersHyDE as a small step, or upgrade embeddings

Critical questions

  • Does your problem actually need RAG? (If the answer is in the LLM's training, skip RAG entirely.)
  • Why not always use Agentic RAG? (Cost and latency. Agent loops can 10x your per-query cost.)
  • When do you pick Graph RAG over Hybrid? (When your domain is explicitly relational: biomedical pathways, org charts, legal citation networks.)
  • Why is HyDE less popular in 2026? (Better embeddings and contextual retrieval close the question-answer gap without an extra LLM call.)
  • Contextual Retrieval is 2024. Why did nobody do it earlier? (Cost: needed a cheap enough LLM to run it on every chunk at indexing. Claude Haiku and GPT-4o-mini made it trivial.)

Production pitfalls

  • Over-engineering. You built Graph RAG + Multimodal + Agentic RAG on day one. You cannot debug any of it. Start simple, add complexity with eval-driven pressure.
  • Wrong retrieval layer for the query shape. Vector-only on SKU codes returns garbage. BM25-only on paraphrases returns nothing. Test your query distribution.
  • No eval suite. You cannot compare architectures without a golden set. Build 50-200 Q&A pairs early.
  • Mixing architectures inconsistently. Agentic RAG where the agent sometimes uses graph, sometimes dense, without a clear rule. Debug nightmare.
  • Ignoring the reranker. Across architectures, a good reranker gives 10-30% accuracy gains. Cheapest win.

Alternatives / Comparisons

RAG is one way to inject knowledge. Alternatives and hybrids:

ApproachKnowledge sourceCost patternWhen
Prompt engineeringNone beyond LLMCheapSmall tasks, LLM knows it
RAG (any architecture)External vector DBQuery-time costPrivate or fresh data
Fine-tuning (LoRA)WeightsTraining-timeStyle, vocab change, not new facts
Full fine-tuningWeightsBig training costRare, replaced by LoRA
Tool use via function callingExternal APIs per callPer-call costLive data, calculations
CAG (KV cache)Cached prefixFlat cost after cacheStable corpus, high QPS

See rag vs finetuning for the full decision matrix.

Mental parallels (non-AI)

  • Library systems over time:
    • Naive RAG = card catalog by keyword.
    • HyDE = asking a friend to describe what you want, then searching with that description.
    • Graph RAG = Wikipedia hyperlink navigation.
    • Agentic RAG = a research librarian who asks clarifying questions, uses multiple catalogs, and validates sources.
    • Corrective RAG = a peer reviewer.
  • Search evolution: Google started as PageRank (sparse). Added semantic search (vectors). Added knowledge graph. Added snippet generation (like RAG generation step). The whole evolution of search mirrors RAG architectures.

Mini-lab

labs/rag-architectures/ (to create):

  1. Build the same Q&A task with 3 architectures on the same corpus:
    • Naive RAG (baseline)
    • Hybrid search + Contextual Retrieval + Rerank
    • Agentic RAG (LangGraph)
  2. Evaluate on 50 Q&A pairs with RAGAS.
  3. Record cost per query, latency, faithfulness, answer relevance.
  4. Summarize: which pattern wins for your data and at what cost?

Stack: uv, langchain, langgraph, qdrant, cohere, anthropic.

Further reading

Canonical

Related in this KB

Tools

ragarchitecturepatternshydegraph-ragagenticrefragcaghybrid