RAG Architectures (the 8 main patterns)
RAG is not one architecture but a family. Daily Dose DS lists 8 common patterns: Naive, Multimodal, HyDE, Corrective, Graph, Hybrid, Adaptive, Agentic. Each fixes a specific failure of Naive RAG. Pick by analyzing your data shape, query shape, and latency budget.
RAG Architectures (the 8 main patterns)
Watch or read first
- Daily Dose DS, "8 RAG architectures" in the AI Engineering Guidebook (2025, paid): https://www.dailydoseofds.com/ai-engineering-guidebook/
- LlamaIndex docs - good taxonomy and code examples: https://docs.llamaindex.ai/en/stable/
- Pinecone Learning Center - hybrid search and advanced RAG: https://www.pinecone.io/learn/
TL;DR
RAG is not one architecture but a family. Daily Dose DS lists 8 common patterns: Naive, Multimodal, HyDE, Corrective, Graph, Hybrid, Adaptive, Agentic. Each fixes a specific failure of Naive RAG. Pick by analyzing your data shape, query shape, and latency budget.
The historical problem
Naive RAG works for simple fact lookup but breaks on:
- Questions worded very differently from answers (semantic gap)
- Multi-hop queries that need 2+ retrieval steps
- Relationships between entities (graphs, not documents)
- Mixed modalities (text + images + tables)
- Low-quality retrieved chunks that mislead the LLM
- Queries that do not need retrieval at all
Each architecture below addresses one or more of these failures.
How it works: the 8 patterns
1. Naive RAG
query --> embed --> search vector DB --> top-k --> stuff in prompt --> LLM --> answer
Simple vector similarity between query and stored chunks. Works for direct factual Q&A on a homogeneous corpus.
When: MVP, simple knowledge bases, "what is X" queries. Breaks when: queries are complex, wording mismatches, multi-hop.
2. Multimodal RAG
Embed and retrieve across text, image, audio, video using models like CLIP, Nomic-Embed-Multimodal, Voyage Multimodal.
text query --> multimodal embed --> search mixed index --> chunks + images --> multimodal LLM --> answer
When: product search with photos, medical imaging + notes, video understanding. Breaks when: your base model cannot reason across modalities well.
3. HyDE (Hypothetical Document Embeddings)
The insight: a question is not semantically similar to its answer. "How does attention work?" does not look like a paragraph describing attention.
HyDE fix: ask the LLM to hallucinate a hypothetical answer first, embed THAT, and search. The hallucinated answer is closer in embedding space to real answers than the raw question.
query --> LLM generates fake answer --> embed fake answer --> search --> real chunks --> LLM --> final answer
See hyde for the deep dive.
When: retrieval recall is low, queries are short questions. Breaks when: hallucinations are too off-topic, latency matters.
4. Corrective RAG
Validate retrieved chunks against trusted sources before sending to the LLM.
query --> retrieve --> check relevance --> if bad, re-retrieve or web search --> LLM
Common pattern: a small classifier scores retrieved chunks, below threshold triggers a fallback (web search, different index, human).
When: high-stakes domains (legal, medical, finance). Freshness matters. Breaks when: the classifier is unreliable or web search is noisy.
5. Graph RAG
Convert retrieved content into a knowledge graph (entities + relationships). The LLM sees both the raw text AND the graph structure.
documents --> entity + relation extraction --> knowledge graph --> query-guided traversal --> LLM
Microsoft's GraphRAG (2024) is the reference implementation.
When: relational queries ("who worked with X and founded Y?"), complex reasoning over entities. Breaks when: data is not entity-centric, graph construction is expensive.
6. Hybrid RAG
Combines dense vector retrieval with sparse retrieval (BM25) OR with graph retrieval in one pipeline.
query --> dense search (vectors) + sparse search (BM25) --> RRF or weighted fusion --> top-k --> LLM
In 2026 this is the DEFAULT for production RAG, not an exotic option. Most vector DBs have it built in.
When: mixed query types (exact terms + concepts), names, codes matter. Breaks when: tuning weights between dense and sparse is non-trivial.
7. Adaptive RAG
Dynamically decides whether a query needs retrieval at all, and how many steps.
query --> classifier -->
simple fact --> direct LLM
single-hop --> Naive RAG
multi-hop --> iterative retrieval with sub-queries
Often implemented as an [[../05-ai-agents/react-pattern|agentic]] system that routes.
When: mixed traffic (some queries need RAG, some do not), cost optimization. Breaks when: the classifier misroutes and the user gets a weak answer.
8. Agentic RAG
The LLM itself is an agent that decides retrieval strategy on the fly: which source, how many times, when to stop.
user query --> agent loop:
plan --> retrieve --> evaluate --> re-plan if needed --> answer
See agentic rag for the deep dive.
When: complex workflows, multi-tool, multi-source (vector DB + SQL + web + APIs). Breaks when: agents loop or cost explodes.
Bonus architectures (not in Daily Dose DS's list of 8)
REFRAG (Meta, 2025): compress chunks into single vectors, RL policy selects which to expand. See refrag and cag.
CAG (Cache-Augmented Generation): put stable context in the KV cache, skip retrieval for it. Hybrid RAG+CAG. See refrag and cag.
Contextual Retrieval (Anthropic, 2024): prepend a context summary to each chunk before embedding. Not really a separate architecture, more a chunking upgrade, but gives huge gains.
Relevance today (2026)
The "8 architectures" frame is didactic, not exclusive
Real production systems mix these. A typical 2026 prod RAG is:
- Hybrid (dense + BM25) + Contextual Retrieval + Reranker + Agentic routing for complex queries + CAG for static policy docs.
That's 5 of the 8 in one system.
The winners of 2024-2026
- Hybrid search: everybody uses it.
- Contextual Retrieval: big wins, low cost.
- Reranking with cross-encoders: standard.
- Agentic RAG: rising fast as reasoning models get cheap.
- Graph RAG: niche, heavy, but strong for the right use case.
The losers
- Naive RAG in prod: a 2023 artifact. If your prod is naive, you have homework.
- HyDE alone: rarely worth the extra LLM call. Often beaten by a better embedding model + reranker.
- Corrective RAG: partly absorbed into Agentic RAG.
The emerging
- REFRAG: compresses retrieval by 30x latency, still early.
- ColBERT / ColPali: late-interaction retrieval (multi-vector per doc). Strong on long docs and visual RAG.
- Cache-Augmented Generation (CAG): practical for stable corpora with frequent queries.
Decision matrix (2026)
| Your situation | Start with |
|---|---|
| Prototype, small corpus | Naive + reranker |
| Production Q&A on docs | Hybrid + Contextual + reranker |
| Multi-source, tool use | Agentic RAG |
| Entities, relationships | Graph RAG |
| Images + text | Multimodal RAG |
| Static policies, high QPS | RAG + CAG |
| Latency-critical, long context | REFRAG |
| Queries differ from answers | HyDE as a small step, or upgrade embeddings |
Critical questions
- Does your problem actually need RAG? (If the answer is in the LLM's training, skip RAG entirely.)
- Why not always use Agentic RAG? (Cost and latency. Agent loops can 10x your per-query cost.)
- When do you pick Graph RAG over Hybrid? (When your domain is explicitly relational: biomedical pathways, org charts, legal citation networks.)
- Why is HyDE less popular in 2026? (Better embeddings and contextual retrieval close the question-answer gap without an extra LLM call.)
- Contextual Retrieval is 2024. Why did nobody do it earlier? (Cost: needed a cheap enough LLM to run it on every chunk at indexing. Claude Haiku and GPT-4o-mini made it trivial.)
Production pitfalls
- Over-engineering. You built Graph RAG + Multimodal + Agentic RAG on day one. You cannot debug any of it. Start simple, add complexity with eval-driven pressure.
- Wrong retrieval layer for the query shape. Vector-only on SKU codes returns garbage. BM25-only on paraphrases returns nothing. Test your query distribution.
- No eval suite. You cannot compare architectures without a golden set. Build 50-200 Q&A pairs early.
- Mixing architectures inconsistently. Agentic RAG where the agent sometimes uses graph, sometimes dense, without a clear rule. Debug nightmare.
- Ignoring the reranker. Across architectures, a good reranker gives 10-30% accuracy gains. Cheapest win.
Alternatives / Comparisons
RAG is one way to inject knowledge. Alternatives and hybrids:
| Approach | Knowledge source | Cost pattern | When |
|---|---|---|---|
| Prompt engineering | None beyond LLM | Cheap | Small tasks, LLM knows it |
| RAG (any architecture) | External vector DB | Query-time cost | Private or fresh data |
| Fine-tuning (LoRA) | Weights | Training-time | Style, vocab change, not new facts |
| Full fine-tuning | Weights | Big training cost | Rare, replaced by LoRA |
| Tool use via function calling | External APIs per call | Per-call cost | Live data, calculations |
| CAG (KV cache) | Cached prefix | Flat cost after cache | Stable corpus, high QPS |
See rag vs finetuning for the full decision matrix.
Mental parallels (non-AI)
- Library systems over time:
- Naive RAG = card catalog by keyword.
- HyDE = asking a friend to describe what you want, then searching with that description.
- Graph RAG = Wikipedia hyperlink navigation.
- Agentic RAG = a research librarian who asks clarifying questions, uses multiple catalogs, and validates sources.
- Corrective RAG = a peer reviewer.
- Search evolution: Google started as PageRank (sparse). Added semantic search (vectors). Added knowledge graph. Added snippet generation (like RAG generation step). The whole evolution of search mirrors RAG architectures.
Mini-lab
labs/rag-architectures/ (to create):
- Build the same Q&A task with 3 architectures on the same corpus:
- Naive RAG (baseline)
- Hybrid search + Contextual Retrieval + Rerank
- Agentic RAG (LangGraph)
- Evaluate on 50 Q&A pairs with RAGAS.
- Record cost per query, latency, faithfulness, answer relevance.
- Summarize: which pattern wins for your data and at what cost?
Stack: uv, langchain, langgraph, qdrant, cohere, anthropic.
Further reading
Canonical
- Daily Dose DS, "8 RAG architectures" (2025, paid): https://www.dailydoseofds.com/ai-engineering-guidebook/
- "From Local to Global: A GraphRAG Approach" (Microsoft, 2024) - https://arxiv.org/abs/2404.16130
- Anthropic Contextual Retrieval (2024) - https://www.anthropic.com/news/contextual-retrieval
- "HyDE" paper (Gao et al., 2022) - https://arxiv.org/abs/2212.10496
Related in this KB
Tools
- LangChain: https://python.langchain.com/ , LlamaIndex: https://docs.llamaindex.ai/ , Haystack: https://haystack.deepset.ai/
- Microsoft GraphRAG: https://github.com/microsoft/graphrag
- Anthropic Contextual Retrieval cookbook: https://github.com/anthropics/anthropic-cookbook/tree/main/skills/contextual-embeddings
- RAGAS: https://docs.ragas.io/ , TruLens for eval: https://www.trulens.org/