RAG
03·RAG·updated 2026-04-19

RAG vs Fine-Tuning vs Prompt Engineering (decision matrix)

Three techniques adapt an LLM to your task: prompt engineering (instructions), RAG (external knowledge), and fine-tuning (weights). They answer different questions. Pick by two axes: how much NEW KNOWLEDGE you need, and how much BEHAVIOR CHANGE you need. Hybrid RAG + fine-tune when you need both.

RAG vs Fine-Tuning vs Prompt Engineering (decision matrix)

Watch or read first

TL;DR

Three techniques adapt an LLM to your task: prompt engineering (instructions), RAG (external knowledge), and fine-tuning (weights). They answer different questions. Pick by two axes: how much NEW KNOWLEDGE you need, and how much BEHAVIOR CHANGE you need. Hybrid RAG + fine-tune when you need both.

The historical problem

When you ship an LLM app, the base model is almost never enough. Early teams defaulted to fine-tuning because they had no other tool, burning GPU time on small vocabulary adjustments or knowledge additions that did not stick.

The insight from the 2023-2024 wave: most "my LLM does not know X" problems are solved with prompt engineering or RAG, not fine-tuning. Fine-tuning is for BEHAVIOR, not knowledge.

Getting this wrong is expensive. A team that fine-tunes to "teach the model about their product" loses weeks and gets worse results than a half-day RAG build.

How it works: the two axes

Daily Dose DS frames it perfectly with two parameters:

  • External knowledge needed: does the task require facts the base LLM does not have?
  • Adaptation needed: does the task require changing the model's style, tone, vocabulary, or behavior?
                          External knowledge
                      LOW                HIGH
                 +---------------+---------------+
                 |               |               |
            HIGH |  Fine-tune    |  RAG +        |
                 |               |  Fine-tune    |
  Adaptation     +---------------+---------------+
                 |               |               |
            LOW  |  Prompt       |  RAG          |
                 |  engineering  |               |
                 +---------------+---------------+

Detailed decision logic

Use Prompt Engineering if:

  • The model already knows the facts
  • You do not need a special style
  • You want to move fast and cheap

Use RAG if:

  • You have a knowledge base the LLM was not trained on (private docs, recent data, niche domain)
  • The model's style and vocabulary are fine as-is
  • You can tolerate retrieval latency

Use Fine-tuning (LoRA) if:

  • You need consistent format, tone, or JSON structure beyond what prompting gives
  • You have domain jargon the base model mangles (medical, legal, company-internal vocab)
  • You have labeled examples (500-5000)
  • You cannot achieve this reliably via prompting alone

Use RAG + Fine-tuning if:

  • You need BOTH new facts AND behavior change
  • Example: a support bot that knows your product docs (RAG) AND speaks in your brand voice with your escalation protocols (fine-tune)

How it works: the three techniques

1. Full fine-tuning

Adjust all weights of the pre-trained model on a new dataset. Mentioned in Daily Dose DS as historically used but problematic for LLMs because:

  • Size: updating 7B-70B parameters is massive
  • Cost: compute for retraining is extreme
  • Ops: you maintain a full copy of the model per fine-tune

Rarely used in 2026 except for foundational model providers. Dead for app teams.

2. LoRA fine-tuning

Decompose the weight matrices into low-rank adapters and train only those. Freeze the base model.

Original network:
  x --> [big weight matrix W, 4096x4096] --> y

LoRA:
  x --> [frozen W] + [A (4096x8) @ B (8x4096)] --> y
                     trainable, ~8000 params vs 16M

Benefits:

  • Train millions of params instead of billions
  • Adapter is small (< 100 MB), easy to ship
  • Stack multiple LoRAs, swap at inference

See lora for the deep dive.

3. RAG

Augment the LLM with retrieved context at inference time. Covered extensively in rag workflow.

Step 1-2: Embed additional data into a vector DB (one-time).
Step 3:   Embed the user query with the same embedding model.
Step 4-5: Find the nearest neighbors in the vector DB.
Step 6-7: Provide the original query + retrieved documents to the LLM.

No weight updates. Changes what the LLM knows PER CALL.

Relevance today (2026)

The 2026 rule of thumb

"Try prompt engineering first. Add RAG when knowledge is missing. Fine-tune only when behavior is wrong."

In that order. Skipping to fine-tuning is usually wrong.

Structured outputs moved a lot of cases out of fine-tuning

In 2023, teams fine-tuned to get consistent JSON. In 2026, structured outputs with grammar-constrained decoding solves this without training. One big reason NOT to fine-tune.

LoRA + QLoRA make fine-tuning easier

If you do need to fine-tune, LoRA or QLoRA (quantized LoRA) on a consumer GPU works. Daily Dose DS's point about fine-tuning being "hard" is less true than it was. But it is still harder than RAG.

Reasoning models reduced the fine-tuning space again

Models like o1, Opus 4.5 thinking, DeepSeek-R1 reason better on complex tasks without domain fine-tuning. Many "I need to fine-tune for reasoning" use cases are now "use a reasoning model".

Hybrid is mainstream

Serious products combine:

  • Prompt engineering with structured outputs
  • RAG with hybrid search + reranking + contextual retrieval
  • A small LoRA for brand voice or domain vocabulary
  • Evals and guardrails around all of it

Anti-patterns to avoid

  • Fine-tuning for knowledge: "We trained GPT on our docs". This rarely sticks. The model forgets, or over-memorizes and hallucinates near the edges. Use RAG.
  • Fine-tuning before trying RAG: classic 2023 mistake. Try RAG first.
  • Over-tuning for style: you can get 80% of brand voice from the system prompt. Fine-tune only for the last 20% if you can measure it.
  • Retraining from scratch: almost never the answer. Even frontier labs rarely do this.

Critical questions

  • Why doesn't fine-tuning work for new facts? (The model memorizes them statistically, diluting prior knowledge. It also hallucinates confidently around the memorized facts. RAG gives cited, grounded facts.)
  • When is full fine-tuning worth it over LoRA? (Almost never for app teams. Research teams do it to align new capabilities.)
  • Why does Daily Dose DS mention that RAG is bad for summarizing entire corpora? (Retrieval only fetches top matches, so the LLM never sees the whole corpus in one call. For summarization, you need map-reduce or a big-context model, not RAG.)
  • Can you fine-tune a small model to replace RAG? (Sometimes, on stable niche domains. Usually not: the model stales, retraining cycle hurts.)
  • What if your domain has rapidly evolving facts? (RAG. Fine-tuning freezes yesterday.)
  • What if latency is the primary constraint? (Fine-tuning beats RAG at inference. A well-trained small model with no retrieval is faster. Weigh against freshness.)

Production pitfalls

  • Thinking "I need to fine-tune" before trying prompts and RAG. Most teams never need to fine-tune.
  • Fine-tuning on 50 examples. Not enough. 500-5000 is the realistic floor.
  • Forgetting eval. Without a baseline, you cannot tell if fine-tuning helped. Always compare.
  • Over-fitting to your training set. Evaluate on held-out data.
  • Leaving RAG off when you fine-tune. RAG and fine-tuning are complementary, not alternatives.
  • Assuming LoRA merges well with base model upgrades. A LoRA trained on Llama 3.1 will not work on Llama 4. Budget for re-training when the base changes.
  • Expecting fine-tuning to fix hallucinations on new facts. It does not. The model just memorizes wrong answers with confidence.

Alternatives / Comparisons

Daily Dose DS gives a four-way split. In 2026, the real picture:

NeedBest first approach
Model does not know a factRAG
Model knows but gives vague answerBetter prompt, few-shot examples
Output format inconsistentStructured outputs
Model mangles domain termsGlossary in prompt, then LoRA if needed
Model's tone is wrongSystem prompt, then LoRA
Reasoning is weakReasoning model or CoT prompting
Data is sensitive, cannot leave orgSelf-hosted LoRA + RAG
New facts added dailyRAG (no fine-tune can keep up)
Latency < 200ms requiredSmall fine-tuned model, no retrieval
Agentic behavior neededReAct prompt + tool calling, not fine-tune

Mental parallels (non-AI)

  • Hiring vs training vs giving instructions:
    • Prompt engineering = giving a well-trained employee a clear task brief
    • RAG = giving the employee access to the company's document library
    • Fine-tuning = sending the employee through a training program to adopt company culture and vocabulary
    • RAG + Fine-tune = both: trained employee with access to all docs
  • Operating system analogy:
    • Prompt = runtime arguments
    • RAG = reading files during execution
    • Fine-tune = recompiling the binary
  • Library with human staff:
    • Staff general knowledge = LLM weights
    • Staff consulting a book for a specific question = RAG
    • Staff going to a 3-month course in a specialty = fine-tune

Mini-lab

labs/rag-vs-finetuning/ (to create):

  1. Pick a domain: legal contracts Q&A over 100 docs.
  2. Implement three pipelines:
    • Prompt engineering only (no context beyond the prompt)
    • RAG over the 100 docs
    • Small LoRA fine-tune on 500 Q&A pairs
    • Hybrid RAG + LoRA
  3. Evaluate on 50 held-out questions:
    • Accuracy
    • Hallucination rate (judge LLM)
    • Cost per query
    • Latency
  4. Rank. Write up when each wins.

Stack: uv, langchain, qdrant, peft for LoRA, anthropic or a self-hosted Llama.

Further reading

Canonical

Related in this KB

Tools

ragfinetuningprompt-engineeringloradecision-matrix