RAG Architectures (the 8 main patterns)

Watch or read first

Daily Dose DS, "8 RAG architectures" in the AI Engineering Guidebook (2025, paid): https://www.dailydoseofds.com/ai-engineering-guidebook/
LlamaIndex docs - good taxonomy and code examples: https://docs.llamaindex.ai/en/stable/
Pinecone Learning Center - hybrid search and advanced RAG: https://www.pinecone.io/learn/

TL;DR

RAG is not one architecture but a family. Daily Dose DS lists 8 common patterns: Naive, Multimodal, HyDE, Corrective, Graph, Hybrid, Adaptive, Agentic. Each fixes a specific failure of Naive RAG. Pick by analyzing your data shape, query shape, and latency budget.

The historical problem

Naive RAG works for simple fact lookup but breaks on:

Questions worded very differently from answers (semantic gap)
Multi-hop queries that need 2+ retrieval steps
Relationships between entities (graphs, not documents)
Mixed modalities (text + images + tables)
Low-quality retrieved chunks that mislead the LLM
Queries that do not need retrieval at all

Each architecture below addresses one or more of these failures.

How it works: the 8 patterns

1. Naive RAG

query --> embed --> search vector DB --> top-k --> stuff in prompt --> LLM --> answer

Simple vector similarity between query and stored chunks. Works for direct factual Q&A on a homogeneous corpus.

When: MVP, simple knowledge bases, "what is X" queries. Breaks when: queries are complex, wording mismatches, multi-hop.

2. Multimodal RAG

Embed and retrieve across text, image, audio, video using models like CLIP, Nomic-Embed-Multimodal, Voyage Multimodal.

text query --> multimodal embed --> search mixed index --> chunks + images --> multimodal LLM --> answer

When: product search with photos, medical imaging + notes, video understanding. Breaks when: your base model cannot reason across modalities well.

3. HyDE (Hypothetical Document Embeddings)

The insight: a question is not semantically similar to its answer. "How does attention work?" does not look like a paragraph describing attention.

HyDE fix: ask the LLM to hallucinate a hypothetical answer first, embed THAT, and search. The hallucinated answer is closer in embedding space to real answers than the raw question.

query --> LLM generates fake answer --> embed fake answer --> search --> real chunks --> LLM --> final answer

See hyde for the deep dive.

When: retrieval recall is low, queries are short questions. Breaks when: hallucinations are too off-topic, latency matters.

4. Corrective RAG

Validate retrieved chunks against trusted sources before sending to the LLM.

query --> retrieve --> check relevance --> if bad, re-retrieve or web search --> LLM

Common pattern: a small classifier scores retrieved chunks, below threshold triggers a fallback (web search, different index, human).

When: high-stakes domains (legal, medical, finance). Freshness matters. Breaks when: the classifier is unreliable or web search is noisy.

5. Graph RAG

Convert retrieved content into a knowledge graph (entities + relationships). The LLM sees both the raw text AND the graph structure.

documents --> entity + relation extraction --> knowledge graph --> query-guided traversal --> LLM

Microsoft's GraphRAG (2024) is the reference implementation.

When: relational queries ("who worked with X and founded Y?"), complex reasoning over entities. Breaks when: data is not entity-centric, graph construction is expensive.

6. Hybrid RAG

Combines dense vector retrieval with sparse retrieval (BM25) OR with graph retrieval in one pipeline.

query --> dense search (vectors) + sparse search (BM25) --> RRF or weighted fusion --> top-k --> LLM

In 2026 this is the DEFAULT for production RAG, not an exotic option. Most vector DBs have it built in.

When: mixed query types (exact terms + concepts), names, codes matter. Breaks when: tuning weights between dense and sparse is non-trivial.

7. Adaptive RAG

Dynamically decides whether a query needs retrieval at all, and how many steps.

query --> classifier -->
  simple fact   --> direct LLM
  single-hop    --> Naive RAG
  multi-hop     --> iterative retrieval with sub-queries

Often implemented as an [[../05-ai-agents/react-pattern|agentic]] system that routes.

When: mixed traffic (some queries need RAG, some do not), cost optimization. Breaks when: the classifier misroutes and the user gets a weak answer.

8. Agentic RAG

The LLM itself is an agent that decides retrieval strategy on the fly: which source, how many times, when to stop.

user query --> agent loop:
  plan --> retrieve --> evaluate --> re-plan if needed --> answer

See agentic rag for the deep dive.

When: complex workflows, multi-tool, multi-source (vector DB + SQL + web + APIs). Breaks when: agents loop or cost explodes.

Bonus architectures (not in Daily Dose DS's list of 8)

REFRAG (Meta, 2025): compress chunks into single vectors, RL policy selects which to expand. See refrag and cag.

CAG (Cache-Augmented Generation): put stable context in the KV cache, skip retrieval for it. Hybrid RAG+CAG. See refrag and cag.

Contextual Retrieval (Anthropic, 2024): prepend a context summary to each chunk before embedding. Not really a separate architecture, more a chunking upgrade, but gives huge gains.

Relevance today (2026)

The "8 architectures" frame is didactic, not exclusive

Real production systems mix these. A typical 2026 prod RAG is:

Hybrid (dense + BM25) + Contextual Retrieval + Reranker + Agentic routing for complex queries + CAG for static policy docs.

That's 5 of the 8 in one system.

The winners of 2024-2026

Hybrid search: everybody uses it.
Contextual Retrieval: big wins, low cost.
Reranking with cross-encoders: standard.
Agentic RAG: rising fast as reasoning models get cheap.
Graph RAG: niche, heavy, but strong for the right use case.

The losers

Naive RAG in prod: a 2023 artifact. If your prod is naive, you have homework.
HyDE alone: rarely worth the extra LLM call. Often beaten by a better embedding model + reranker.
Corrective RAG: partly absorbed into Agentic RAG.

The emerging

REFRAG: compresses retrieval by 30x latency, still early.
ColBERT / ColPali: late-interaction retrieval (multi-vector per doc). Strong on long docs and visual RAG.
Cache-Augmented Generation (CAG): practical for stable corpora with frequent queries.

Decision matrix (2026)

Your situation	Start with
Prototype, small corpus	Naive + reranker
Production Q&A on docs	Hybrid + Contextual + reranker
Multi-source, tool use	Agentic RAG
Entities, relationships	Graph RAG
Images + text	Multimodal RAG
Static policies, high QPS	RAG + CAG
Latency-critical, long context	REFRAG
Queries differ from answers	HyDE as a small step, or upgrade embeddings

Critical questions

Does your problem actually need RAG? (If the answer is in the LLM's training, skip RAG entirely.)
Why not always use Agentic RAG? (Cost and latency. Agent loops can 10x your per-query cost.)
When do you pick Graph RAG over Hybrid? (When your domain is explicitly relational: biomedical pathways, org charts, legal citation networks.)
Why is HyDE less popular in 2026? (Better embeddings and contextual retrieval close the question-answer gap without an extra LLM call.)
Contextual Retrieval is 2024. Why did nobody do it earlier? (Cost: needed a cheap enough LLM to run it on every chunk at indexing. Claude Haiku and GPT-4o-mini made it trivial.)

Production pitfalls

Over-engineering. You built Graph RAG + Multimodal + Agentic RAG on day one. You cannot debug any of it. Start simple, add complexity with eval-driven pressure.
Wrong retrieval layer for the query shape. Vector-only on SKU codes returns garbage. BM25-only on paraphrases returns nothing. Test your query distribution.
No eval suite. You cannot compare architectures without a golden set. Build 50-200 Q&A pairs early.
Mixing architectures inconsistently. Agentic RAG where the agent sometimes uses graph, sometimes dense, without a clear rule. Debug nightmare.
Ignoring the reranker. Across architectures, a good reranker gives 10-30% accuracy gains. Cheapest win.

Alternatives / Comparisons

RAG is one way to inject knowledge. Alternatives and hybrids:

Approach	Knowledge source	Cost pattern	When
Prompt engineering	None beyond LLM	Cheap	Small tasks, LLM knows it
RAG (any architecture)	External vector DB	Query-time cost	Private or fresh data
Fine-tuning (LoRA)	Weights	Training-time	Style, vocab change, not new facts
Full fine-tuning	Weights	Big training cost	Rare, replaced by LoRA
Tool use via function calling	External APIs per call	Per-call cost	Live data, calculations
CAG (KV cache)	Cached prefix	Flat cost after cache	Stable corpus, high QPS

See rag vs finetuning for the full decision matrix.

Mental parallels (non-AI)

Library systems over time:
- Naive RAG = card catalog by keyword.
- HyDE = asking a friend to describe what you want, then searching with that description.
- Graph RAG = Wikipedia hyperlink navigation.
- Agentic RAG = a research librarian who asks clarifying questions, uses multiple catalogs, and validates sources.
- Corrective RAG = a peer reviewer.
Search evolution: Google started as PageRank (sparse). Added semantic search (vectors). Added knowledge graph. Added snippet generation (like RAG generation step). The whole evolution of search mirrors RAG architectures.

Mini-lab

labs/rag-architectures/ (to create):

Build the same Q&A task with 3 architectures on the same corpus:
- Naive RAG (baseline)
- Hybrid search + Contextual Retrieval + Rerank
- Agentic RAG (LangGraph)
Evaluate on 50 Q&A pairs with RAGAS.
Record cost per query, latency, faithfulness, answer relevance.
Summarize: which pattern wins for your data and at what cost?

Stack: uv, langchain, langgraph, qdrant, cohere, anthropic.

RAG Architectures (the 8 main patterns)

RAG Architectures (the 8 main patterns)

Watch or read first

TL;DR

The historical problem

How it works: the 8 patterns

1. Naive RAG

2. Multimodal RAG

3. HyDE (Hypothetical Document Embeddings)

4. Corrective RAG

5. Graph RAG

6. Hybrid RAG

7. Adaptive RAG

8. Agentic RAG

Bonus architectures (not in Daily Dose DS's list of 8)

Relevance today (2026)

The "8 architectures" frame is didactic, not exclusive

The winners of 2024-2026

The losers

The emerging

Decision matrix (2026)

Critical questions

Production pitfalls

Alternatives / Comparisons

Mental parallels (non-AI)

Mini-lab

Further reading

Canonical

Related in this KB

Tools