RAG
03·RAG·updated 2026-04-19

HyDE (Hypothetical Document Embeddings)

A question and its answer are not semantically similar. "How does attention work?" does not look like a paragraph describing attention. HyDE fixes this by asking the LLM to hallucinate a fake answer first, then embeds the fake answer (not the question) to search. Retrieval quality often jumps. Cost: one extra LLM call per query.

HyDE (Hypothetical Document Embeddings)

Watch or read first

TL;DR

A question and its answer are not semantically similar. "How does attention work?" does not look like a paragraph describing attention. HyDE fixes this by asking the LLM to hallucinate a fake answer first, then embeds the fake answer (not the question) to search. Retrieval quality often jumps. Cost: one extra LLM call per query.

The historical problem

In Naive RAG, you embed the user's query and search for chunks whose embeddings are close. But a question is written very differently from an answer:

Query: "What causes the greenhouse effect?"
Answer chunk: "Greenhouse gases like CO2 and methane trap infrared radiation
               from Earth's surface, warming the atmosphere. The main sources
               are fossil fuels and agriculture."

A short interrogative sentence has few content words. A paragraph answering it has many. In embedding space, they land in different neighborhoods. Many irrelevant texts with similar interrogative style can score higher than the real answer.

This is the question-answer asymmetry problem. HyDE (Gao et al., Dec 2022) proposes a clever workaround: transform the question into something shaped like an answer before searching.

How it works

The 4-step HyDE pipeline

1. Query Q arrives.
2. Ask an LLM: "Write a paragraph that answers: Q."
   Output: a Hypothetical answer H (may contain hallucinations).
3. Embed H with your contriever/bi-encoder to get vector E.
4. Use E to search the vector DB -> top-k chunks C.
5. Send Q + C to the LLM for the final grounded answer.

Why hallucinations don't hurt

Step 2 produces an answer that may be factually wrong. That seems bad. But it does not matter, because:

  • The contriever was trained with contrastive learning to pull "similar topics" together in embedding space, regardless of factual accuracy.
  • It acts as a near-lossless compressor that strips hallucinations and keeps the topical signature.
  • The resulting embedding E lands close to the real answer chunks in the vector DB.

So HyDE leverages the LLM as a "query expander" in answer-shape, and the retrieval step filters out the hallucinated content.

Visual intuition

Embedding space (simplified 2D)

  Question Q --> embed --> E_Q   (lands in "question-shaped" region)

  Real answer chunks   --> embed --> E_chunk_1, E_chunk_2, ...
                                     (land in "answer-shaped" region)

  Hypothetical answer H --> embed --> E_H  (lands in "answer-shaped" region)

  E_H is close to E_chunk_1, E_chunk_2.
  E_Q is not. So retrieval with E_H wins.

Code sketch

def hyde_retrieve(query: str) -> list[Chunk]:
    # Step 1: hallucinate an answer
    prompt = f"Write a paragraph that answers: {query}"
    hypothetical = llm.complete(prompt)

    # Step 2: embed the fake answer
    query_vector = embed(hypothetical)

    # Step 3: search
    return vector_db.search(query_vector, k=10)

Relevance today (2026)

The gap HyDE closed has shrunk

In 2022, embedding models like text-embedding-ada-002 did a mediocre job bridging question-answer asymmetry. HyDE gave measurable gains.

By 2026:

  • Better embedding models: text-embedding-3-large, voyage-3, BGE-M3, Cohere v3 explicitly train with asymmetric pairs (query-document). Question-answer gap is much smaller.
  • Instruction-tuned embeddings: some models accept a task prefix ("Represent this query for retrieval"), which moves the query into answer space without an LLM call.
  • Contextual Retrieval (Anthropic, 2024): prepends a short context to each chunk before embedding. Different approach, similar effect.

On modern embeddings, raw HyDE often matches or is beaten by "good embedding + reranker". The extra LLM call is not free.

When HyDE still wins

  • You are stuck with an older embedding model (e.g., local BGE-small with limited training)
  • The corpus is very domain-specific (legal, medical) and generic embeddings do not capture it
  • Queries are ultra-short ("GDPR fine") and the embedding model has no context to work with
  • You already have a small cheap LLM in the pipeline, adding a HyDE call is negligible

When to skip HyDE

  • Modern embedding models + reranker already work
  • Latency budget is tight
  • The cheap LLM you would use is noisy (small open-source models may hallucinate too hard)
  • Contextual Retrieval at indexing time is an alternative that is pay-once, not per-query

Hybrid patterns

  • HyDE only on ambiguous queries: classify query length/clarity, use HyDE for the hard ones.
  • Query rewriting + HyDE: first paraphrase the query, then HyDE.
  • Multi-HyDE: generate N hypothetical answers, embed each, aggregate rankings.

Daily Dose DS notes HyDE improves retrieval but comes with "increased latency and more LLM usage". True in 2026. Budget for it.

Critical questions

  • Why does the hypothetical answer not contaminate the final answer? (Because the LLM in step 5 is grounded in the retrieved REAL chunks, not the fake answer.)
  • Why is the contriever a bi-encoder, not a cross-encoder? (Cross-encoders score pairs; bi-encoders produce one vector you can index. HyDE needs indexing.)
  • Could you use a cross-encoder reranker and skip HyDE? (Yes, often cheaper. Modern RAG does this.)
  • What if the LLM refuses to answer (safety filter)? (HyDE fails. Fallback to query embedding.)
  • Does HyDE help on non-English queries? (Depends on the LLM and embedding coverage. Test per-language.)
  • Can you cache hypothetical answers? (Yes, if queries repeat. Classic query-cache pattern applies.)

Production pitfalls

  • Latency doubles. You now have one extra LLM call on the hot path. If your budget is 500ms, HyDE often blows it.
  • Hallucinated content leaks. Make sure step 5 does NOT include the hypothetical answer in the final prompt (only the real chunks).
  • Small LLMs produce garbage. HyDE with a tiny local model can hurt recall. Use at minimum GPT-4o-mini / Claude Haiku quality.
  • Domain mismatch. A general LLM writes a fake answer that is generic, landing far from your domain-specific chunks. Fine-tune the HyDE generator or use an in-domain LLM.
  • Cost creep. Per-query cost = HyDE call + retrieval + generation call. 3x the LLM work vs Naive RAG.
  • No eval. Measure recall@k before and after HyDE on your golden set. In 2026 it often does NOT help. Verify.

Alternatives / Comparisons

ApproachCost2026 effectiveness
HyDE1 extra LLM call per queryUsed to be 10-20% recall boost, now often flat
Better embedding modelIndexing cost, flat per-queryOften bigger gain than HyDE
Query expansion (synonyms, paraphrases)FastSmall gain
Contextual RetrievalLLM call per chunk at indexingBig gain, Anthropic reports 49% fewer failures
Hybrid search (BM25 + dense)SmallBig gain on names, codes
Cross-encoder rerankerFast call on top-kBig gain always
Agentic RAG with retrieval self-critiqueExpensiveBiggest gain for complex queries

The 2026 take: HyDE is a nice technique to know. It is usually not the first or second thing you reach for. Better embeddings + Contextual Retrieval + Reranker covers most of what HyDE offered, more cheaply and more reliably.

Mental parallels (non-AI)

  • Answering a jeopardy question: you come up with a plausible answer first, then check it against your memory. HyDE does the same thing.
  • Googling: users often type in keyword-shaped, not question-shaped. You transform your thought into search-engine-shaped language. HyDE transforms the query into retrieval-friendly language automatically.
  • Peer review: when you write a paper, you often draft a hypothetical reviewer's comment to anticipate what they will look for. Similar inversion.

Mini-lab

labs/hyde/ (to create):

  1. Take the MS MARCO or BEIR benchmark (question-passage pairs).
  2. Implement two retrievers:
    • Baseline: embed the query directly
    • HyDE: LLM generates a hypothetical answer, embed that
  3. Use text-embedding-3-small as the embedding model.
  4. Measure recall@10 and NDCG@10 on 500 queries.
  5. Repeat with a cross-encoder reranker on top.
  6. Compute ROI: HyDE gain vs HyDE extra latency and cost.

Stack: uv, openai, sentence-transformers for BGE baseline, beir library.

Further reading

Canonical

Related in this KB

Complementary techniques

raghyderetrievalhypotheticalembeddings