LoRA Adapters
Lightweight fine-tuning where you train small "delta weights" (low-rank) instead of the full model. 100-1000x fewer parameters, GPU memory divided, and adapters can be loaded **at runtime** to specialize a generalist model for a use case.
LoRA Adapters
TL;DR
Lightweight fine-tuning where you train small "delta weights" (low-rank) instead of the full model. 100-1000x fewer parameters, GPU memory divided, and adapters can be loaded at runtime to specialize a generalist model for a use case.
The historical problem
Classic fine-tuning (full fine-tuning) recomputes ALL weights of a model:
- Llama-7B = 7 billion parameters to update
- Needs huge GPUs (VRAM ~4x the model size for training)
- Long, expensive training, and each variant = a new model to store
- Impossible to have 50 variants for 50 use cases
How it works
Low-Rank Adaptation (LoRA) introduced by Hu et al (2021):
- Freeze the base model weights (pretrained)
- For each target layer (often Q, V of attention), add two small matrices A and B
- The weight delta is
A @ Bwith a very small rankr(r=8, 16, 32 typically) - During training, only A and B are updated
- At inference:
W_effective = W_frozen + A @ B
Result: for Llama-7B, a LoRA is typically 10-100 MB (vs 14 GB for the full model).
Key advantages:
- Fast and cheap training (a few hours on 1 GPU)
- Storage: one LoRA per use case (~10 MB each)
- Hot loading via [[../07-llm-optimization/vllm|vLLM]]: same base model, swap LoRA on the fly
- Multi-tenant: a "call center" request uses LoRA A, an "internal mails" request uses LoRA B, all on the same server
Relevance today (2026)
LoRA has become the standard for fine-tuning in 2024-2026:
- QLoRA (2023): adds 4-bit quantization of the base model. Allows fine-tuning Llama-70B on a single A100.
- DoRA (2024): decomposes the delta into magnitude and direction, better results than classic LoRA
- LoRA+ (2024): different learning rates for A and B, converges faster
- vLLM + multi-LoRA: now production-ready, stacks like BentoML/Modal serve 100+ adapters on a single endpoint
- Mistral, Anthropic, OpenAI offer managed LoRA fine-tuning endpoints
Trends to watch:
- Dynamic LoRA: generate the LoRA on the fly for each user/tenant (expensive but ultra-personalized)
- MoE + LoRA: combinations on Mixture-of-Experts models
- ReFT (2024): alternative more efficient than LoRA in some cases
Critical question: does full fine-tuning still make sense in 2026 ?
- Rarely. Only if you change the model's fundamental behavior (language, very different domain) or for continual pretraining.
- For 95% of enterprise use cases, LoRA or QLoRA is enough.
Critical questions
- Why A and B and not a single matrix ? What is the point of the low-rank ?
- How to choose rank
r? Too small = underfit, too large = lose the benefit - Which layers should LoRA apply to ? Q+V enough ? Or all of them ?
- QLoRA: why does 4-bit quantization of the base model not break fine-tuning quality ?
- If I fine-tune 10 different LoRAs on 10 tasks, can there be "collisions" when serving all 10 at the same time ?
- How much data do you really need for a good LoRA ? (1k, 10k, 100k examples ?)
- LoRA vs prompt engineering / few-shot: when is each better ?
Production pitfalls
- Data quality > quantity: a LoRA on 1000 good examples > 100k noisy ones
- Catastrophic forgetting: a very specialized LoRA can "break" the model's general capabilities
- LoRA merge: if you merge several LoRAs into the base model, you lose the runtime swap benefit
- Large-scale serving: GPU memory to load 50 simultaneous LoRAs = non trivial
- Drift: training data ages, periodic retrain needed
- Eval: without a [[../08-evaluations/golden-sets|golden set]], impossible to know if the LoRA really helps
Alternatives / Comparisons
| Approach | When ? | Cost | Quality |
|---|---|---|---|
| Full fine-tuning | Fundamental change (language, domain) | $$$$ | Top |
| LoRA | Specialization per use case | $ | Very good (90-95% of full) |
| QLoRA | Fine-tuning big model on limited budget | $ | Good (85-90% of full) |
| DoRA / LoRA+ | When LoRA plateaus | $$ | Better than LoRA |
| Prompt engineering | Cosmetic / structural change | 0 | Variable |
| RAG | Knowledge injection | $$ | Different (factual vs behavioral) |
| RLHF / DPO | Alignment / preferences | $$$ | Top for alignment |
Mini-lab
[[labs/01-lora-finetuning-small-model/]] - fine-tune a LoRA on a small model (Phi-3 or Qwen2-0.5B) with a custom dataset, then load it via vLLM at runtime.
To create: /lab lora-adapters.
Further reading
- Original LoRA paper (Hu et al, 2021)
- QLoRA paper (Dettmers et al, 2023)
- DoRA paper (2024)
- PEFT library (HuggingFace)
- vLLM multi-LoRA docs: https://docs.vllm.ai/en/latest/models/lora.html
- Axolotl / Unsloth: practical frameworks for LoRA training
- CNCF / Google talk on multi-LoRA serving at scale