RAG vs Fine-Tuning vs Prompt Engineering (decision matrix)
Three techniques adapt an LLM to your task: prompt engineering (instructions), RAG (external knowledge), and fine-tuning (weights). They answer different questions. Pick by two axes: how much NEW KNOWLEDGE you need, and how much BEHAVIOR CHANGE you need. Hybrid RAG + fine-tune when you need both.
RAG vs Fine-Tuning vs Prompt Engineering (decision matrix)
Watch or read first
- Daily Dose DS, "Prompting vs. RAG vs. Finetuning?" and "Full-model Fine-tuning vs. LoRA vs. RAG" in the AI Engineering Guidebook (2025, paid): https://www.dailydoseofds.com/ai-engineering-guidebook/
- OpenAI Cookbook, "When to fine-tune": https://cookbook.openai.com/
- Hugging Face PEFT docs for LoRA: https://huggingface.co/docs/peft/
TL;DR
Three techniques adapt an LLM to your task: prompt engineering (instructions), RAG (external knowledge), and fine-tuning (weights). They answer different questions. Pick by two axes: how much NEW KNOWLEDGE you need, and how much BEHAVIOR CHANGE you need. Hybrid RAG + fine-tune when you need both.
The historical problem
When you ship an LLM app, the base model is almost never enough. Early teams defaulted to fine-tuning because they had no other tool, burning GPU time on small vocabulary adjustments or knowledge additions that did not stick.
The insight from the 2023-2024 wave: most "my LLM does not know X" problems are solved with prompt engineering or RAG, not fine-tuning. Fine-tuning is for BEHAVIOR, not knowledge.
Getting this wrong is expensive. A team that fine-tunes to "teach the model about their product" loses weeks and gets worse results than a half-day RAG build.
How it works: the two axes
Daily Dose DS frames it perfectly with two parameters:
- External knowledge needed: does the task require facts the base LLM does not have?
- Adaptation needed: does the task require changing the model's style, tone, vocabulary, or behavior?
External knowledge
LOW HIGH
+---------------+---------------+
| | |
HIGH | Fine-tune | RAG + |
| | Fine-tune |
Adaptation +---------------+---------------+
| | |
LOW | Prompt | RAG |
| engineering | |
+---------------+---------------+
Detailed decision logic
Use Prompt Engineering if:
- The model already knows the facts
- You do not need a special style
- You want to move fast and cheap
Use RAG if:
- You have a knowledge base the LLM was not trained on (private docs, recent data, niche domain)
- The model's style and vocabulary are fine as-is
- You can tolerate retrieval latency
Use Fine-tuning (LoRA) if:
- You need consistent format, tone, or JSON structure beyond what prompting gives
- You have domain jargon the base model mangles (medical, legal, company-internal vocab)
- You have labeled examples (500-5000)
- You cannot achieve this reliably via prompting alone
Use RAG + Fine-tuning if:
- You need BOTH new facts AND behavior change
- Example: a support bot that knows your product docs (RAG) AND speaks in your brand voice with your escalation protocols (fine-tune)
How it works: the three techniques
1. Full fine-tuning
Adjust all weights of the pre-trained model on a new dataset. Mentioned in Daily Dose DS as historically used but problematic for LLMs because:
- Size: updating 7B-70B parameters is massive
- Cost: compute for retraining is extreme
- Ops: you maintain a full copy of the model per fine-tune
Rarely used in 2026 except for foundational model providers. Dead for app teams.
2. LoRA fine-tuning
Decompose the weight matrices into low-rank adapters and train only those. Freeze the base model.
Original network:
x --> [big weight matrix W, 4096x4096] --> y
LoRA:
x --> [frozen W] + [A (4096x8) @ B (8x4096)] --> y
trainable, ~8000 params vs 16M
Benefits:
- Train millions of params instead of billions
- Adapter is small (< 100 MB), easy to ship
- Stack multiple LoRAs, swap at inference
See lora for the deep dive.
3. RAG
Augment the LLM with retrieved context at inference time. Covered extensively in rag workflow.
Step 1-2: Embed additional data into a vector DB (one-time).
Step 3: Embed the user query with the same embedding model.
Step 4-5: Find the nearest neighbors in the vector DB.
Step 6-7: Provide the original query + retrieved documents to the LLM.
No weight updates. Changes what the LLM knows PER CALL.
Relevance today (2026)
The 2026 rule of thumb
"Try prompt engineering first. Add RAG when knowledge is missing. Fine-tune only when behavior is wrong."
In that order. Skipping to fine-tuning is usually wrong.
Structured outputs moved a lot of cases out of fine-tuning
In 2023, teams fine-tuned to get consistent JSON. In 2026, structured outputs with grammar-constrained decoding solves this without training. One big reason NOT to fine-tune.
LoRA + QLoRA make fine-tuning easier
If you do need to fine-tune, LoRA or QLoRA (quantized LoRA) on a consumer GPU works. Daily Dose DS's point about fine-tuning being "hard" is less true than it was. But it is still harder than RAG.
Reasoning models reduced the fine-tuning space again
Models like o1, Opus 4.5 thinking, DeepSeek-R1 reason better on complex tasks without domain fine-tuning. Many "I need to fine-tune for reasoning" use cases are now "use a reasoning model".
Hybrid is mainstream
Serious products combine:
- Prompt engineering with structured outputs
- RAG with hybrid search + reranking + contextual retrieval
- A small LoRA for brand voice or domain vocabulary
- Evals and guardrails around all of it
Anti-patterns to avoid
- Fine-tuning for knowledge: "We trained GPT on our docs". This rarely sticks. The model forgets, or over-memorizes and hallucinates near the edges. Use RAG.
- Fine-tuning before trying RAG: classic 2023 mistake. Try RAG first.
- Over-tuning for style: you can get 80% of brand voice from the system prompt. Fine-tune only for the last 20% if you can measure it.
- Retraining from scratch: almost never the answer. Even frontier labs rarely do this.
Critical questions
- Why doesn't fine-tuning work for new facts? (The model memorizes them statistically, diluting prior knowledge. It also hallucinates confidently around the memorized facts. RAG gives cited, grounded facts.)
- When is full fine-tuning worth it over LoRA? (Almost never for app teams. Research teams do it to align new capabilities.)
- Why does Daily Dose DS mention that RAG is bad for summarizing entire corpora? (Retrieval only fetches top matches, so the LLM never sees the whole corpus in one call. For summarization, you need map-reduce or a big-context model, not RAG.)
- Can you fine-tune a small model to replace RAG? (Sometimes, on stable niche domains. Usually not: the model stales, retraining cycle hurts.)
- What if your domain has rapidly evolving facts? (RAG. Fine-tuning freezes yesterday.)
- What if latency is the primary constraint? (Fine-tuning beats RAG at inference. A well-trained small model with no retrieval is faster. Weigh against freshness.)
Production pitfalls
- Thinking "I need to fine-tune" before trying prompts and RAG. Most teams never need to fine-tune.
- Fine-tuning on 50 examples. Not enough. 500-5000 is the realistic floor.
- Forgetting eval. Without a baseline, you cannot tell if fine-tuning helped. Always compare.
- Over-fitting to your training set. Evaluate on held-out data.
- Leaving RAG off when you fine-tune. RAG and fine-tuning are complementary, not alternatives.
- Assuming LoRA merges well with base model upgrades. A LoRA trained on Llama 3.1 will not work on Llama 4. Budget for re-training when the base changes.
- Expecting fine-tuning to fix hallucinations on new facts. It does not. The model just memorizes wrong answers with confidence.
Alternatives / Comparisons
Daily Dose DS gives a four-way split. In 2026, the real picture:
| Need | Best first approach |
|---|---|
| Model does not know a fact | RAG |
| Model knows but gives vague answer | Better prompt, few-shot examples |
| Output format inconsistent | Structured outputs |
| Model mangles domain terms | Glossary in prompt, then LoRA if needed |
| Model's tone is wrong | System prompt, then LoRA |
| Reasoning is weak | Reasoning model or CoT prompting |
| Data is sensitive, cannot leave org | Self-hosted LoRA + RAG |
| New facts added daily | RAG (no fine-tune can keep up) |
| Latency < 200ms required | Small fine-tuned model, no retrieval |
| Agentic behavior needed | ReAct prompt + tool calling, not fine-tune |
Mental parallels (non-AI)
- Hiring vs training vs giving instructions:
- Prompt engineering = giving a well-trained employee a clear task brief
- RAG = giving the employee access to the company's document library
- Fine-tuning = sending the employee through a training program to adopt company culture and vocabulary
- RAG + Fine-tune = both: trained employee with access to all docs
- Operating system analogy:
- Prompt = runtime arguments
- RAG = reading files during execution
- Fine-tune = recompiling the binary
- Library with human staff:
- Staff general knowledge = LLM weights
- Staff consulting a book for a specific question = RAG
- Staff going to a 3-month course in a specialty = fine-tune
Mini-lab
labs/rag-vs-finetuning/ (to create):
- Pick a domain: legal contracts Q&A over 100 docs.
- Implement three pipelines:
- Prompt engineering only (no context beyond the prompt)
- RAG over the 100 docs
- Small LoRA fine-tune on 500 Q&A pairs
- Hybrid RAG + LoRA
- Evaluate on 50 held-out questions:
- Accuracy
- Hallucination rate (judge LLM)
- Cost per query
- Latency
- Rank. Write up when each wins.
Stack: uv, langchain, qdrant, peft for LoRA, anthropic or a self-hosted Llama.
Further reading
Canonical
- Daily Dose DS, "Prompting vs. RAG vs. Finetuning?" and "Full-model Fine-tuning vs. LoRA vs. RAG" (2025, paid): https://www.dailydoseofds.com/ai-engineering-guidebook/
- Karpathy, "Fine-tuning is not the answer most of the time" (tweets, essays): https://x.com/karpathy
- OpenAI Cookbook, "When to fine-tune": https://cookbook.openai.com/articles/techniques_to_improve_reliability
Related in this KB
Tools
- Hugging Face PEFT (LoRA): https://github.com/huggingface/peft
- OpenAI fine-tuning API: https://platform.openai.com/docs/guides/fine-tuning
- Unsloth (fast LoRA): https://github.com/unslothai/unsloth