Skillia
← Back to articles

How to handle bad RAG results gracefully

Your RAG system found nothing relevant. Now what? The industry patterns for fallback strategies, relevance thresholds, and honest abstention.

Your user asks "What does Rashi say about Genesis 1:1?" Your RAG system searches the vector database, finds 20 chunks, reranks them, and... none of them are actually about Rashi on Genesis 1:1.

What do you do?

Most RAG tutorials skip this part. They show you the happy path: query goes in, perfect answer comes out. But in production, bad retrieval happens constantly. The question is: how do you fail gracefully?


The problem

A RAG system retrieves documents, then sends them to an LLM as context. If the retrieved documents are not relevant, you have two bad options:

  1. The LLM hallucinates - it generates a confident answer from the irrelevant context, or worse, falls back to its training data and makes up a source reference
  2. The LLM says nothing useful - "I don't have information about this" which is accurate but unhelpful

Neither is what a good product does.


Pattern 1: Relevance gating with a threshold

This is the most common pattern in production RAG systems. After reranking, check the score of the top result. If it is below a threshold, trigger a different response path.

Cohere Rerank returns a relevance score between 0 and 1 for each document:

ranked = cohere_client.rerank(
    model="rerank-english-v3.0",
    query="What does Rashi say about Genesis 1:1?",
    documents=retrieved_chunks,
    top_n=5,
)

# Check the best result
top_score = ranked.results[0].relevance_score

if top_score >= 0.3:
    # Good retrieval: generate answer from sources
    generate_answer(ranked.results)
else:
    # Bad retrieval: fallback path
    fallback_response()

The threshold (0.3 in this example) depends on your domain. For general knowledge, 0.2 might be enough. For religious texts where accuracy is critical, 0.4-0.5 might be safer. You calibrate it by testing real queries and checking where the "relevant vs not relevant" boundary falls.

Real numbers from our Torah Study AI project:

  • "What is the Shema prayer?" (text exists in our database) - score: 0.833
  • "Rashi on Deuteronomy 1:1" (text not indexed yet) - score: 0.023

The gap is massive. A threshold of 0.3 cleanly separates the two cases.


Pattern 2: Smart fallback (don't just say "not found")

When retrieval fails, the worst response is "I don't know." A better response helps the user find what they need.

Here is what we built for Torah Study AI. When the rerank score is below the threshold:

  1. Acknowledge honestly - "I don't have this text in my library yet"
  2. Provide a direct link - build a URL to where the text actually lives (in our case, sefaria.org)
  3. Suggest related topics - show what you DO have that is closest to their question
FALLBACK_PROMPT = """The user asked a question but the relevant sources
were not found in the library.

Your job is to:
1. Acknowledge that you don't have the exact text
2. Build a direct link to where they can find it
3. Suggest 3 related topics from the sources you DO have
"""

The user gets a response like:

I don't have Rashi's commentary on Bereishit 1:1 in my library yet. You can find it here: https://www.sefaria.org/Rashi_on_Genesis.1.1

In the meantime, I have other Rashi commentaries we could explore:

  • Rashi on Deuteronomy 6:5 (the Shema)
  • Rashi on Deuteronomy 11:14 (rain in its time)

This is the difference between a dead end and a helpful guide.


Pattern 3: Corrective RAG (CRAG)

Sometimes the retrieval fails not because the text is missing, but because the query is badly formulated. This is especially common with:

  • Transliteration - "Bereshit" vs "Bereishit" vs "Genesis" vs "Beresheet"
  • Ambiguous terms - "Shabbos" vs "Shabbat" vs "Sabbath"
  • Vague questions - "Tell me about Moses" could match thousands of texts

Corrective RAG (from the CRAG paper by Yan et al., 2024) adds a retry step:

Query -> Retrieve -> Grade relevance -> NOT RELEVANT?
    -> Reformulate query -> Retrieve again -> Grade -> Generate

The LLM reformulates the query before the second attempt. "Bereshit 1:1" might become "Genesis 1:1" or "Rashi on the first verse of the Torah."

This is more complex to implement but solves a real problem, especially for multilingual or domain-specific content.


Pattern 4: Never fall back to LLM knowledge on sensitive data

This deserves its own section because it is the most important decision.

When retrieval fails, the tempting solution is: "Just let the LLM answer from its training data." The LLM probably knows what Rashi says about Genesis 1:1. Why not use it?

Because on sensitive data, a wrong source is worse than no source. If your RAG system is about:

  • Religious texts - a made-up Talmud citation is an offense, not just an error
  • Legal documents - a hallucinated clause could cause real harm
  • Medical information - wrong information could be dangerous

In these domains, honest abstention ("I don't have this source") is always better than a confident guess.

Anthropic's own RAG best practices recommend: instruct the model to say "I don't have enough information" rather than guess. Sefaria's search returns empty results rather than guessing. Enterprise RAG products like Glean show "no results found" rather than hallucinating.


What good products do

ProductWhen retrieval fails
PerplexityReformulates query, retries from different sources
GleanShows "no relevant results found", suggests alternative search terms
SefariaReturns empty results, suggests browsing by category
Notion AIStates it can't find relevant workspace content, flags when using general knowledge

The pattern is clear: the best products are honest about what they don't know and helpful about where to look next.


In practice

Here is the decision tree for your RAG pipeline:

Retrieve 20 chunks from vector DB
        |
Rerank with Cohere (keep top 5)
        |
Top score >= threshold?
    |           |
   YES          NO
    |           |
Generate     Try query reformulation (CRAG)
answer          |
with         Re-retrieve and re-rank
sources         |
             Still below threshold?
                |           |
               YES          NO
                |           |
            Fallback:    Generate
            - honest     answer
              message    with
            - direct     sources
              link
            - suggest
              related
              topics

Start with the simple threshold. Add CRAG when you see queries failing due to transliteration or ambiguity. Never fall back to LLM general knowledge on sensitive domains.