LoRA Adapters

TL;DR

Lightweight fine-tuning where you train small "delta weights" (low-rank) instead of the full model. 100-1000x fewer parameters, GPU memory divided, and adapters can be loaded at runtime to specialize a generalist model for a use case.

The historical problem

Classic fine-tuning (full fine-tuning) recomputes ALL weights of a model:

Llama-7B = 7 billion parameters to update
Needs huge GPUs (VRAM ~4x the model size for training)
Long, expensive training, and each variant = a new model to store
Impossible to have 50 variants for 50 use cases

How it works

Low-Rank Adaptation (LoRA) introduced by Hu et al (2021):

Freeze the base model weights (pretrained)
For each target layer (often Q, V of attention), add two small matrices A and B
The weight delta is A @ B with a very small rank r (r=8, 16, 32 typically)
During training, only A and B are updated
At inference: W_effective = W_frozen + A @ B

Result: for Llama-7B, a LoRA is typically 10-100 MB (vs 14 GB for the full model).

Key advantages:

Fast and cheap training (a few hours on 1 GPU)
Storage: one LoRA per use case (~10 MB each)
Hot loading via [[../07-llm-optimization/vllm|vLLM]]: same base model, swap LoRA on the fly
Multi-tenant: a "call center" request uses LoRA A, an "internal mails" request uses LoRA B, all on the same server

Relevance today (2026)

LoRA has become the standard for fine-tuning in 2024-2026:

QLoRA (2023): adds 4-bit quantization of the base model. Allows fine-tuning Llama-70B on a single A100.
DoRA (2024): decomposes the delta into magnitude and direction, better results than classic LoRA
LoRA+ (2024): different learning rates for A and B, converges faster
vLLM + multi-LoRA: now production-ready, stacks like BentoML/Modal serve 100+ adapters on a single endpoint
Mistral, Anthropic, OpenAI offer managed LoRA fine-tuning endpoints

Trends to watch:

Dynamic LoRA: generate the LoRA on the fly for each user/tenant (expensive but ultra-personalized)
MoE + LoRA: combinations on Mixture-of-Experts models
ReFT (2024): alternative more efficient than LoRA in some cases

Critical question: does full fine-tuning still make sense in 2026 ?

Rarely. Only if you change the model's fundamental behavior (language, very different domain) or for continual pretraining.
For 95% of enterprise use cases, LoRA or QLoRA is enough.

Critical questions

Why A and B and not a single matrix ? What is the point of the low-rank ?
How to choose rank r ? Too small = underfit, too large = lose the benefit
Which layers should LoRA apply to ? Q+V enough ? Or all of them ?
QLoRA: why does 4-bit quantization of the base model not break fine-tuning quality ?
If I fine-tune 10 different LoRAs on 10 tasks, can there be "collisions" when serving all 10 at the same time ?
How much data do you really need for a good LoRA ? (1k, 10k, 100k examples ?)
LoRA vs prompt engineering / few-shot: when is each better ?

Production pitfalls

Data quality > quantity: a LoRA on 1000 good examples > 100k noisy ones
Catastrophic forgetting: a very specialized LoRA can "break" the model's general capabilities
LoRA merge: if you merge several LoRAs into the base model, you lose the runtime swap benefit
Large-scale serving: GPU memory to load 50 simultaneous LoRAs = non trivial
Drift: training data ages, periodic retrain needed
Eval: without a [[../08-evaluations/golden-sets|golden set]], impossible to know if the LoRA really helps

Alternatives / Comparisons

Approach	When ?	Cost	Quality
Full fine-tuning	Fundamental change (language, domain)	$$$$	Top
LoRA	Specialization per use case	$	Very good (90-95% of full)
QLoRA	Fine-tuning big model on limited budget	$	Good (85-90% of full)
DoRA / LoRA+	When LoRA plateaus	$$	Better than LoRA
Prompt engineering	Cosmetic / structural change	0	Variable
RAG	Knowledge injection	$$	Different (factual vs behavioral)
RLHF / DPO	Alignment / preferences	$$$	Top for alignment

Mini-lab

[[labs/01-lora-finetuning-small-model/]] - fine-tune a LoRA on a small model (Phi-3 or Qwen2-0.5B) with a custom dataset, then load it via vLLM at runtime.

To create: /lab lora-adapters.

LoRA Adapters

LoRA Adapters

TL;DR

The historical problem

How it works

Relevance today (2026)

Critical questions

Production pitfalls

Alternatives / Comparisons

Mini-lab

Further reading