Fine-tuning
10·Fine-tuning·updated 2026-04-13

LoRA Adapters

Lightweight fine-tuning where you train small "delta weights" (low-rank) instead of the full model. 100-1000x fewer parameters, GPU memory divided, and adapters can be loaded **at runtime** to specialize a generalist model for a use case.

LoRA Adapters

TL;DR

Lightweight fine-tuning where you train small "delta weights" (low-rank) instead of the full model. 100-1000x fewer parameters, GPU memory divided, and adapters can be loaded at runtime to specialize a generalist model for a use case.

The historical problem

Classic fine-tuning (full fine-tuning) recomputes ALL weights of a model:

  • Llama-7B = 7 billion parameters to update
  • Needs huge GPUs (VRAM ~4x the model size for training)
  • Long, expensive training, and each variant = a new model to store
  • Impossible to have 50 variants for 50 use cases

How it works

Low-Rank Adaptation (LoRA) introduced by Hu et al (2021):

  1. Freeze the base model weights (pretrained)
  2. For each target layer (often Q, V of attention), add two small matrices A and B
  3. The weight delta is A @ B with a very small rank r (r=8, 16, 32 typically)
  4. During training, only A and B are updated
  5. At inference: W_effective = W_frozen + A @ B

Result: for Llama-7B, a LoRA is typically 10-100 MB (vs 14 GB for the full model).

Key advantages:

  • Fast and cheap training (a few hours on 1 GPU)
  • Storage: one LoRA per use case (~10 MB each)
  • Hot loading via [[../07-llm-optimization/vllm|vLLM]]: same base model, swap LoRA on the fly
  • Multi-tenant: a "call center" request uses LoRA A, an "internal mails" request uses LoRA B, all on the same server

Relevance today (2026)

LoRA has become the standard for fine-tuning in 2024-2026:

  • QLoRA (2023): adds 4-bit quantization of the base model. Allows fine-tuning Llama-70B on a single A100.
  • DoRA (2024): decomposes the delta into magnitude and direction, better results than classic LoRA
  • LoRA+ (2024): different learning rates for A and B, converges faster
  • vLLM + multi-LoRA: now production-ready, stacks like BentoML/Modal serve 100+ adapters on a single endpoint
  • Mistral, Anthropic, OpenAI offer managed LoRA fine-tuning endpoints

Trends to watch:

  • Dynamic LoRA: generate the LoRA on the fly for each user/tenant (expensive but ultra-personalized)
  • MoE + LoRA: combinations on Mixture-of-Experts models
  • ReFT (2024): alternative more efficient than LoRA in some cases

Critical question: does full fine-tuning still make sense in 2026 ?

  • Rarely. Only if you change the model's fundamental behavior (language, very different domain) or for continual pretraining.
  • For 95% of enterprise use cases, LoRA or QLoRA is enough.

Critical questions

  • Why A and B and not a single matrix ? What is the point of the low-rank ?
  • How to choose rank r ? Too small = underfit, too large = lose the benefit
  • Which layers should LoRA apply to ? Q+V enough ? Or all of them ?
  • QLoRA: why does 4-bit quantization of the base model not break fine-tuning quality ?
  • If I fine-tune 10 different LoRAs on 10 tasks, can there be "collisions" when serving all 10 at the same time ?
  • How much data do you really need for a good LoRA ? (1k, 10k, 100k examples ?)
  • LoRA vs prompt engineering / few-shot: when is each better ?

Production pitfalls

  • Data quality > quantity: a LoRA on 1000 good examples > 100k noisy ones
  • Catastrophic forgetting: a very specialized LoRA can "break" the model's general capabilities
  • LoRA merge: if you merge several LoRAs into the base model, you lose the runtime swap benefit
  • Large-scale serving: GPU memory to load 50 simultaneous LoRAs = non trivial
  • Drift: training data ages, periodic retrain needed
  • Eval: without a [[../08-evaluations/golden-sets|golden set]], impossible to know if the LoRA really helps

Alternatives / Comparisons

ApproachWhen ?CostQuality
Full fine-tuningFundamental change (language, domain)$$$$Top
LoRASpecialization per use case$Very good (90-95% of full)
QLoRAFine-tuning big model on limited budget$Good (85-90% of full)
DoRA / LoRA+When LoRA plateaus$$Better than LoRA
Prompt engineeringCosmetic / structural change0Variable
RAGKnowledge injection$$Different (factual vs behavioral)
RLHF / DPOAlignment / preferences$$$Top for alignment

Mini-lab

[[labs/01-lora-finetuning-small-model/]] - fine-tune a LoRA on a small model (Phi-3 or Qwen2-0.5B) with a custom dataset, then load it via vLLM at runtime.

To create: /lab lora-adapters.

Further reading

fine-tuningloraadaptersmulti-tenant