5 techniques to make your RAG system actually work

You built a RAG system. It retrieves documents, sends them to an LLM, and generates answers. Sometimes the answers are great. Sometimes they are terrible. Sometimes they hallucinate sources that do not exist.

A vanilla RAG is a starting point, not a destination. Here are 5 techniques that turn a fragile prototype into a system you can trust.

1. Chaining: debug each step separately

A vanilla RAG does everything in one shot: retrieve, rank, generate. When the output is bad, you do not know which step failed.

The fix: separate your pipeline into independent steps that you can test and debug individually.

Step 1: Search (Weaviate) -> 20 candidates
Step 2: Rerank (Cohere) -> 5 best, scored 0-1
Step 3: Check relevance (threshold 0.3)
Step 4: Generate answer (Gemini) OR Fallback

Each step has its own input and output. If the answer is bad, you check:

Did Step 1 return the right documents? If not, the embedding or the query is the problem.
Did Step 2 rank them correctly? If not, the reranker model needs tuning.
Did Step 3 let bad documents through? If not, adjust the threshold.
Did Step 4 generate a bad answer from good documents? If not, the prompt needs work.

Stanford CS230 (Autumn 2025) calls this "chaining complex prompts." The principle: separate your pipeline so you can identify where you lose quality, instead of treating the whole system as a black box.

2. Few-shot prompting: show, do not tell

Instead of describing the output format you want, show examples of ideal outputs directly in your prompt.

Here is an example of an ideal response:

Question: What is Shabbat?

### TL;DR
Shabbat is the weekly Jewish day of rest, observed from Friday
evening to Saturday night.

### Sources
- **Exodus 20:8**: "Remember the Sabbath day, to keep it holy."
- **Mishnah Shabbat 7:2**: Lists the 39 categories of prohibited work.

### Explanation
The Torah commands observance of Shabbat in two places...

This works because the LLM learns from patterns. Describing a format in words ("use headers, cite sources") is ambiguous. Showing a concrete example is unambiguous. The LLM copies the structure.

Research from HBS/Wharton (2024) showed that consultants trained in prompt engineering outperformed those who were not. The key difference: the trained group used few-shot examples and structured prompts.

3. HyDE: when the question does not look like the answer

A common RAG failure: the user asks "What is Shabbat?" but the relevant documents contain "Remember the Sabbath day to keep it holy" or "The laws of Shabbos." The question vector and the document vector are not close enough because they use different words.

HyDE (Hypothetical Document Embeddings) solves this. Instead of embedding the question directly, you first generate a hypothetical answer, embed that, and search with it.

Question: "What is Shabbat?"
    |
    v
LLM generates a fake answer:
"Shabbat is the Jewish day of rest observed from Friday evening
to Saturday night. It is one of the Ten Commandments..."
    |
    v
Embed the fake answer (not the question)
    |
    v
Search Weaviate with this embedding

The fake answer is closer in vector space to the real documents because it uses similar vocabulary and structure. It does not matter that the fake answer might be wrong - it is only used for retrieval, not for the final response.

This is especially useful for cross-language retrieval: a Hebrew question about Shabbat will generate a Hebrew hypothetical document, which will be closer to Hebrew texts in the database.

4. LLM-as-Judge: automated quality evaluation

Manual testing does not scale. You cannot have a human review every response. But traditional metrics (BLEU, ROUGE) do not capture whether a Torah citation is real or hallucinated.

The solution: use another LLM as a judge, with a specific rubric.

Three approaches:

Pairwise comparison: "Here are two answers. Which one is better?" Simple but only tells you relative quality.

Single answer grading: "Rate this answer from 1 to 5." Quick but subjective without a rubric.

Rubric-guided grading: "Here is what a 5/5 looks like. Here is what a 1/5 looks like. Rate this answer."

Example rubric for a Torah study chatbot:

5/5: Cites 2+ verifiable sources with correct references. Explains the concept clearly. Includes halakhic disclaimer when relevant.
3/5: Cites 1 source correctly. Explanation is adequate but shallow. Disclaimer present.
1/5: Invents a source or cites incorrectly. Explanation is misleading. No disclaimer on a halakhic question.

Combine this with automated metrics (Ragas faithfulness, answer relevancy) for comprehensive evaluation. The LLM judge catches quality issues that metrics miss, and metrics catch consistency issues that a judge might overlook.

5. Relevance gating: know when to say "I don't know"

The most important technique for sensitive domains (legal, medical, religious texts): check if the retrieved documents are actually relevant before generating an answer.

After reranking, the top document has a score between 0 and 1. If that score is below a threshold (e.g., 0.3), the documents are not relevant enough. Instead of generating from bad context, trigger a fallback.

if top_score >= 0.3:
    generate_answer(sources)  # Good retrieval
else:
    fallback_response()       # Honest + helpful

The fallback should not be "I don't know." It should be helpful:

Acknowledge what was not found
Link to where the user can find it
Suggest related topics that ARE in the database

On sensitive domains, a wrong source is worse than no source. A fabricated Talmud citation is not just an error - it is an offense. Relevance gating ensures you never generate from irrelevant context.

The pattern

These five techniques build on each other:

1. Chain your pipeline -> debug each step
2. Few-shot in prompts -> align output format
3. HyDE for retrieval -> better document matching
4. LLM-as-Judge -> automated quality checks
5. Relevance gating -> never hallucinate from bad context

Start with 1 and 5 (they are the easiest and highest impact). Add 2 when your output format is inconsistent. Add 3 when cross-language or vocabulary mismatch is a problem. Add 4 when you need to scale evaluation beyond manual testing.