RAG vs Fine-Tuning vs Prompt Engineering (decision matrix)

Watch or read first

Daily Dose DS, "Prompting vs. RAG vs. Finetuning?" and "Full-model Fine-tuning vs. LoRA vs. RAG" in the AI Engineering Guidebook (2025, paid): https://www.dailydoseofds.com/ai-engineering-guidebook/
OpenAI Cookbook, "When to fine-tune": https://cookbook.openai.com/
Hugging Face PEFT docs for LoRA: https://huggingface.co/docs/peft/

TL;DR

Three techniques adapt an LLM to your task: prompt engineering (instructions), RAG (external knowledge), and fine-tuning (weights). They answer different questions. Pick by two axes: how much NEW KNOWLEDGE you need, and how much BEHAVIOR CHANGE you need. Hybrid RAG + fine-tune when you need both.

The historical problem

When you ship an LLM app, the base model is almost never enough. Early teams defaulted to fine-tuning because they had no other tool, burning GPU time on small vocabulary adjustments or knowledge additions that did not stick.

The insight from the 2023-2024 wave: most "my LLM does not know X" problems are solved with prompt engineering or RAG, not fine-tuning. Fine-tuning is for BEHAVIOR, not knowledge.

Getting this wrong is expensive. A team that fine-tunes to "teach the model about their product" loses weeks and gets worse results than a half-day RAG build.

How it works: the two axes

Daily Dose DS frames it perfectly with two parameters:

External knowledge needed: does the task require facts the base LLM does not have?
Adaptation needed: does the task require changing the model's style, tone, vocabulary, or behavior?

                          External knowledge
                      LOW                HIGH
                 +---------------+---------------+
                 |               |               |
            HIGH |  Fine-tune    |  RAG +        |
                 |               |  Fine-tune    |
  Adaptation     +---------------+---------------+
                 |               |               |
            LOW  |  Prompt       |  RAG          |
                 |  engineering  |               |
                 +---------------+---------------+

Detailed decision logic

Use Prompt Engineering if:

The model already knows the facts
You do not need a special style
You want to move fast and cheap

Use RAG if:

You have a knowledge base the LLM was not trained on (private docs, recent data, niche domain)
The model's style and vocabulary are fine as-is
You can tolerate retrieval latency

Use Fine-tuning (LoRA) if:

You need consistent format, tone, or JSON structure beyond what prompting gives
You have domain jargon the base model mangles (medical, legal, company-internal vocab)
You have labeled examples (500-5000)
You cannot achieve this reliably via prompting alone

Use RAG + Fine-tuning if:

You need BOTH new facts AND behavior change
Example: a support bot that knows your product docs (RAG) AND speaks in your brand voice with your escalation protocols (fine-tune)

How it works: the three techniques

1. Full fine-tuning

Adjust all weights of the pre-trained model on a new dataset. Mentioned in Daily Dose DS as historically used but problematic for LLMs because:

Size: updating 7B-70B parameters is massive
Cost: compute for retraining is extreme
Ops: you maintain a full copy of the model per fine-tune

Rarely used in 2026 except for foundational model providers. Dead for app teams.

2. LoRA fine-tuning

Decompose the weight matrices into low-rank adapters and train only those. Freeze the base model.

Original network:
  x --> [big weight matrix W, 4096x4096] --> y

LoRA:
  x --> [frozen W] + [A (4096x8) @ B (8x4096)] --> y
                     trainable, ~8000 params vs 16M

Benefits:

Train millions of params instead of billions
Adapter is small (< 100 MB), easy to ship
Stack multiple LoRAs, swap at inference

See lora for the deep dive.

3. RAG

Augment the LLM with retrieved context at inference time. Covered extensively in rag workflow.

Step 1-2: Embed additional data into a vector DB (one-time).
Step 3:   Embed the user query with the same embedding model.
Step 4-5: Find the nearest neighbors in the vector DB.
Step 6-7: Provide the original query + retrieved documents to the LLM.

No weight updates. Changes what the LLM knows PER CALL.

Relevance today (2026)

The 2026 rule of thumb

"Try prompt engineering first. Add RAG when knowledge is missing. Fine-tune only when behavior is wrong."

In that order. Skipping to fine-tuning is usually wrong.

Structured outputs moved a lot of cases out of fine-tuning

In 2023, teams fine-tuned to get consistent JSON. In 2026, structured outputs with grammar-constrained decoding solves this without training. One big reason NOT to fine-tune.

LoRA + QLoRA make fine-tuning easier

If you do need to fine-tune, LoRA or QLoRA (quantized LoRA) on a consumer GPU works. Daily Dose DS's point about fine-tuning being "hard" is less true than it was. But it is still harder than RAG.

Reasoning models reduced the fine-tuning space again

Models like o1, Opus 4.5 thinking, DeepSeek-R1 reason better on complex tasks without domain fine-tuning. Many "I need to fine-tune for reasoning" use cases are now "use a reasoning model".

Hybrid is mainstream

Serious products combine:

Prompt engineering with structured outputs
RAG with hybrid search + reranking + contextual retrieval
A small LoRA for brand voice or domain vocabulary
Evals and guardrails around all of it

Anti-patterns to avoid

Fine-tuning for knowledge: "We trained GPT on our docs". This rarely sticks. The model forgets, or over-memorizes and hallucinates near the edges. Use RAG.
Fine-tuning before trying RAG: classic 2023 mistake. Try RAG first.
Over-tuning for style: you can get 80% of brand voice from the system prompt. Fine-tune only for the last 20% if you can measure it.
Retraining from scratch: almost never the answer. Even frontier labs rarely do this.

Critical questions

Why doesn't fine-tuning work for new facts? (The model memorizes them statistically, diluting prior knowledge. It also hallucinates confidently around the memorized facts. RAG gives cited, grounded facts.)
When is full fine-tuning worth it over LoRA? (Almost never for app teams. Research teams do it to align new capabilities.)
Why does Daily Dose DS mention that RAG is bad for summarizing entire corpora? (Retrieval only fetches top matches, so the LLM never sees the whole corpus in one call. For summarization, you need map-reduce or a big-context model, not RAG.)
Can you fine-tune a small model to replace RAG? (Sometimes, on stable niche domains. Usually not: the model stales, retraining cycle hurts.)
What if your domain has rapidly evolving facts? (RAG. Fine-tuning freezes yesterday.)
What if latency is the primary constraint? (Fine-tuning beats RAG at inference. A well-trained small model with no retrieval is faster. Weigh against freshness.)

Production pitfalls

Thinking "I need to fine-tune" before trying prompts and RAG. Most teams never need to fine-tune.
Fine-tuning on 50 examples. Not enough. 500-5000 is the realistic floor.
Forgetting eval. Without a baseline, you cannot tell if fine-tuning helped. Always compare.
Over-fitting to your training set. Evaluate on held-out data.
Leaving RAG off when you fine-tune. RAG and fine-tuning are complementary, not alternatives.
Assuming LoRA merges well with base model upgrades. A LoRA trained on Llama 3.1 will not work on Llama 4. Budget for re-training when the base changes.
Expecting fine-tuning to fix hallucinations on new facts. It does not. The model just memorizes wrong answers with confidence.

Alternatives / Comparisons

Daily Dose DS gives a four-way split. In 2026, the real picture:

Need	Best first approach
Model does not know a fact	RAG
Model knows but gives vague answer	Better prompt, few-shot examples
Output format inconsistent	Structured outputs
Model mangles domain terms	Glossary in prompt, then LoRA if needed
Model's tone is wrong	System prompt, then LoRA
Reasoning is weak	Reasoning model or CoT prompting
Data is sensitive, cannot leave org	Self-hosted LoRA + RAG
New facts added daily	RAG (no fine-tune can keep up)
Latency < 200ms required	Small fine-tuned model, no retrieval
Agentic behavior needed	ReAct prompt + tool calling, not fine-tune

Mental parallels (non-AI)

Hiring vs training vs giving instructions:
- Prompt engineering = giving a well-trained employee a clear task brief
- RAG = giving the employee access to the company's document library
- Fine-tuning = sending the employee through a training program to adopt company culture and vocabulary
- RAG + Fine-tune = both: trained employee with access to all docs
Operating system analogy:
- Prompt = runtime arguments
- RAG = reading files during execution
- Fine-tune = recompiling the binary
Library with human staff:
- Staff general knowledge = LLM weights
- Staff consulting a book for a specific question = RAG
- Staff going to a 3-month course in a specialty = fine-tune

Mini-lab

labs/rag-vs-finetuning/ (to create):

Pick a domain: legal contracts Q&A over 100 docs.
Implement three pipelines:
- Prompt engineering only (no context beyond the prompt)
- RAG over the 100 docs
- Small LoRA fine-tune on 500 Q&A pairs
- Hybrid RAG + LoRA
Evaluate on 50 held-out questions:
- Accuracy
- Hallucination rate (judge LLM)
- Cost per query
- Latency
Rank. Write up when each wins.

Stack: uv, langchain, qdrant, peft for LoRA, anthropic or a self-hosted Llama.

RAG vs Fine-Tuning vs Prompt Engineering (decision matrix)

RAG vs Fine-Tuning vs Prompt Engineering (decision matrix)

Watch or read first

TL;DR

The historical problem

How it works: the two axes

Detailed decision logic

How it works: the three techniques

1. Full fine-tuning

2. LoRA fine-tuning

3. RAG

Relevance today (2026)

The 2026 rule of thumb

Structured outputs moved a lot of cases out of fine-tuning

LoRA + QLoRA make fine-tuning easier

Reasoning models reduced the fine-tuning space again

Hybrid is mainstream

Anti-patterns to avoid

Critical questions

Production pitfalls

Alternatives / Comparisons

Mental parallels (non-AI)

Mini-lab

Further reading

Canonical

Related in this KB

Tools