03·9 notions

RAG

From naive RAG to production: embeddings, chunking, vector stores, hybrid search, reranking.

RAG Workflow (end-to-end pipeline)

RAG (Retrieval-Augmented Generation) wires an LLM to a private knowledge source. The standard pipeline is 8 steps: chunk, embed, store, query, embed query, retrieve top-k, rerank, generate. Each step is a failure point. Mastering RAG means mastering all eight.

Vector Databases

A vector database stores unstructured data (text, image, audio) as high-dimensional vectors called embeddings. It supports fast approximate nearest neighbor (ANN) search over millions to billions of vectors. In RAG, the vector DB is the LLM's external memory: the place where the LLM can "look up" information it was not trained on.

Chunking

Split a text into blocks to embed them individually in a RAG. Size and overlap are critical parameters that impact precision, cost and latency.

Vector Stores

Specialized database to store embeddings (vectors of numbers) and query them by similarity. Central building block of a RAG.

RAG Architectures (the 8 main patterns)

RAG is not one architecture but a family. Daily Dose DS lists 8 common patterns: Naive, Multimodal, HyDE, Corrective, Graph, Hybrid, Adaptive, Agentic. Each fixes a specific failure of Naive RAG. Pick by analyzing your data shape, query shape, and latency budget.

HyDE (Hypothetical Document Embeddings)

A question and its answer are not semantically similar. "How does attention work?" does not look like a paragraph describing attention. HyDE fixes this by asking the LLM to hallucinate a fake answer first, then embeds the fake answer (not the question) to search. Retrieval quality often jumps. Cost: one extra LLM call per query.

Agentic RAG

Agentic RAG replaces the fixed "retrieve once, generate once" pipeline with an agent that decides WHEN to retrieve, FROM WHICH source, HOW MANY TIMES, and whether the answer is good enough. It turns RAG from a static pipeline into a reasoning loop, paying compute for better accuracy on complex queries.

REFRAG and CAG (Cache-Augmented Generation)

REFRAG and CAG both attack a core RAG inefficiency: the LLM wastes compute on irrelevant or repeatedly-seen context. REFRAG (Meta, 2025) compresses each chunk into a single vector and only expands the relevant ones, achieving 30x faster time-to-first-token. CAG moves stable knowledge into the model's KV cache so it is never re-fetched or re-processed. In practice, combine: use CAG for stable data, RAG (optionally REFRAG) for volatile data.

RAG vs Fine-Tuning vs Prompt Engineering (decision matrix)

Three techniques adapt an LLM to your task: prompt engineering (instructions), RAG (external knowledge), and fine-tuning (weights). They answer different questions. Pick by two axes: how much NEW KNOWLEDGE you need, and how much BEHAVIOR CHANGE you need. Hybrid RAG + fine-tune when you need both.