How a RAG server works: from question to answer

You ask a question. A few seconds later, you get an answer with real sources. What happened in between?

A RAG server is a pipeline. Each step takes the output of the previous one and makes it better. Six steps, from your question to your answer.

Step 1: Turn the question into numbers

Your question is text. The database stores vectors (lists of numbers). To search, you need to convert your question into the same format.

An embedding model reads your question and produces a vector - a list of thousands of numbers that represent the meaning of what you asked. Two questions about the same topic will produce similar vectors, even if they use different words.

"What is Shabbat?" and "Tell me about the Sabbath" produce vectors that are close together. "How to cook pasta?" produces a vector that is far away.

This step takes milliseconds. The same model that embedded your documents (during indexing) now embeds your question. They must be the same model, otherwise the vectors live on different "maps."

Step 2: Search the database

The vector database compares your question vector against millions of stored document vectors. It finds the closest ones - the documents whose meaning is most similar to your question.

This is like searching a library, but instead of matching keywords, you match meaning. You ask about "the day of rest" and find documents about "Shabbat" even if they never use the word "rest."

The database returns 20 candidates. Fast, but rough. Some are very relevant. Some are only vaguely related.

Step 3: Rerank the results

The initial search was fast because it compared vectors independently. But it never actually "read" each document together with your question.

A reranker does exactly that. It takes your question and each of the 20 candidates, reads them together, and scores how relevant each one really is. Score from 0 (not relevant) to 1 (perfect match).

The reranker catches things the initial search missed. A document about "Shabbat candles" might have been ranked 15th by vector similarity but jumps to 2nd after the reranker reads it with the question "What are the blessings we say on Friday night?"

This step keeps fewer documents but picks better ones. 20 candidates become 5.

Step 4: Check relevance

Before sending anything to the LLM, check: are these results actually relevant?

The reranker gives each document a score. If the best score is below a threshold (for example, 0.3), it means none of the retrieved documents really match the question.

This check is critical. Without it, the LLM receives irrelevant context and generates a bad answer - or worse, hallucinates a confident-sounding response based on documents that have nothing to do with the question.

Step 5: Generate the answer

The top 5 documents are sent to the LLM as context, along with your question. The LLM reads the documents, understands them, and writes an answer.

The LLM does not "know" the answer from its training data. It reads the documents you gave it and synthesizes a response. This is the key difference from a regular chatbot: every claim in the answer can be traced back to a specific document.

A system prompt tells the LLM how to behave: cite your sources, do not invent information, answer in the user's language.

The answer streams back word by word (Server-Sent Events), so the user sees it appear in real time instead of waiting for the full response.

Step 6: Fallback (when retrieval fails)

If step 4 determined that no document is relevant enough, the pipeline skips the LLM entirely. Instead, it returns a helpful fallback:

Acknowledge that the text was not found
Link to where the user can find it (the original source)
Suggest related topics that ARE in the database

This is better than letting the LLM generate from bad context. On sensitive domains (legal, medical, religious texts), a wrong source is worse than no source.

The complete picture

Six steps. Each one is simple on its own. Together, they turn a question into a sourced, verifiable answer.

The quality of your RAG system depends on every step. Bad embeddings means bad search. No reranking means noisy results. No relevance check means hallucinations. No fallback means dead ends.

Get all six right, and your users get answers they can trust.