Post-Training (SFT + RLHF/DPO)

TL;DR

Post-training turns a raw pre-trained language model into something usable. Two main steps: Supervised Fine-Tuning (SFT) teaches the model to follow instructions, then Preference Fine-Tuning (RLHF, DPO) aligns it with human preferences. Every modern assistant (ChatGPT, Claude, Gemini) goes through this pipeline.

The historical problem

A pre-trained language model is just a "completion machine". Given "How to make pizza", it may respond:

"for a family of six?" (adding context)
"What ingredients do I need? How much time?" (adding questions)
Actual instructions on how to make pizza

Only option 3 is useful as an assistant. Pre-training alone does not teach the model that it should answer, be safe, be polite, refuse certain requests, or format responses properly.

In 2020, GPT-3 raw was hard to use directly. OpenAI's breakthrough in 2022 was InstructGPT (SFT + RLHF), which made the model a useful chatbot. ChatGPT was InstructGPT with a UI on top.

Every production LLM since then goes through post-training.

How it works

Step 1: Supervised Fine-Tuning (SFT)

Also called "behavior cloning" or "instruction tuning".

You show the model examples of (prompt, good_response) pairs. The model is trained to produce the response given the prompt. This teaches the shape of useful answers.

Called demonstration data:

Prompt	Labeler response
"Use 'serendipity' in a sentence."	"Running into Margaret and being introduced to Tom was a fortunate stroke of serendipity."
"ELI5: cause of the anxiety lump in chest?"	"The lump is caused by muscular tension keeping your glottis dilated..."

The data must cover the range of tasks you want: QA, summarization, translation, refusals, multi-turn dialogue, etc.

Economics from InstructGPT (Huyen):

~13,000 (prompt, response) pairs
$10 per pair for expert labelers (90% college-educated)
$130,000 just in labeling
Plus designing tasks, QA, etc.

This is expensive. Alternatives:

LAION volunteer approach (free but biased)
DeepMind filtered dialogues from internet data
Synthetic data: generate (prompt, response) pairs with a stronger model (now dominant in 2026)

Step 2: Preference Fine-Tuning

After SFT, the model answers, but answers can still be unhelpful, verbose, harmful, or wrong. Preference fine-tuning shapes HOW the model answers.

The question: should the model write an essay claiming one race is inferior? Everyone agrees: no. Should it discuss gun control, abortion, Israel-Palestine? Opinions diverge wildly. Preference fine-tuning is "embedding universal human preference into AI", which Huyen notes is "ambitious, if not impossible".

Two dominant algorithms:

RLHF (original, OpenAI 2022)

Reinforcement Learning from Human Feedback. Two parts:

Part A: train a reward model.

Labelers compare pairs: given (prompt, response_A, response_B), which is better?
Labels are comparison data (prompt, winning_response, losing_response)
Reward model learns to score responses
Cost per comparison: ~$3.50 (Llama 2 team), vs $25 for writing a response from scratch
Inter-labeler agreement: ~73% (OpenAI)

Part B: fine-tune the SFT model against the reward model.

Sample prompts, generate responses with the SFT model
Score each response with the reward model
Use PPO (Proximal Policy Optimization) to update the model to maximize scores
This is the "reinforcement learning" part

DPO (Direct Preference Optimization, 2023)

Rafailov et al. showed you can skip the reward model entirely. DPO directly trains the model to prefer winning responses over losing ones, using comparison data.

Advantages over RLHF:

No separate reward model to train
No RL loop (simpler implementation)
Less compute
Less prone to reward hacking

Meta switched from RLHF (Llama 2) to DPO (Llama 3) to reduce complexity.

The reward model detail

The reward model is usually fine-tuned from the same base model as the LLM being trained. Some teams believe the reward model should be at least as strong as the foundation model to score it reliably. Others say a weaker judge can evaluate a stronger model, since "judging is easier than generation" (which Huyen covers more in Chapter 3 on evaluation).

Best-of-N as a shortcut

Some teams (Stitch Fix, Grab) skip the RL loop entirely. They use the reward model to score N samples at inference time and pick the best. Cheaper, simpler, and often enough. Related to test time compute.

Relevance today (2026)

Huyen's treatment of SFT and RLHF is solid. Updates since Dec 2024:

1. DPO dominates RLHF for most teams

Huyen featured RLHF because it is more flexible. In 2026, DPO and its variants (IPO, KTO, ORPO, SimPO) are the default for most teams. RLHF stays for frontier labs (OpenAI, Anthropic) who want maximum control.

Variants active in 2026:

ORPO (Odds Ratio Preference Optimization, 2024): combines SFT and DPO in one step
SimPO (2024): no reference model needed, simpler
KTO (Kahneman-Tversky Optimization, 2024): does not need paired data, just good/bad labels

2. Synthetic data is the norm

Labeled (prompt, response) pairs at $10 each is 2022 economics. In 2026:

Frontier models generate demonstration data for smaller models (distillation)
Self-generated preference data (model generates 2 responses, a stronger model judges)
Constitutional AI (Anthropic): model critiques and revises its own outputs against a written constitution

Llama 4, Phi-4, DeepSeek-V3 all lean heavily on synthetic post-training data.

3. Reasoning models need different post-training

OpenAI o1 and o3, Claude Opus 4.x thinking mode, and DeepSeek R1 introduced a new post-training recipe focused on reasoning traces:

Huge datasets of (problem, chain-of-thought, answer)
Reinforcement learning on verified outputs (math, code)
Explicit "thinking" token budget during inference

The traditional SFT + DPO is not enough for these. Full recipe details are guarded. DeepSeek R1's paper (2024) is the most open window.

4. Alignment is a moving target

Anthropic's 2022 finding (more alignment training can decrease alignment) was an early warning. In 2026:

RLHF can introduce subtle biases (sycophancy, over-refusal, verbose outputs)
Models can learn to game verifiers (reward hacking)
"Helpful, harmless, honest" pull in tension
Constitutional AI and debate-based alignment are research frontiers

5. Post-training democratized for fine-tuning

In 2022, only big labs could do full RLHF. In 2026:

Axolotl, Unsloth, TRL let a solo engineer fine-tune + DPO on a 4090 GPU
LoRA + DPO stacks well. You can fine-tune a 70B model's behavior on consumer hardware
HuggingFace TRL has implementations of every preference algorithm

Critical questions

Why is RLHF more "flexible" than DPO? What specifically can you do with RLHF that DPO cannot?
If synthetic data is good enough, why does anyone still pay labelers?
Can a weak reward model train a strong policy? (Theoretical issues vs practical results differ.)
Preference fine-tuning "embeds human preference" - whose preference? OpenAI labelers in SF? The answer shapes the model's worldview.
Is the ultimate goal to eliminate SFT and RLHF by making pre-training better?

Production pitfalls

Over-alignment reduces usefulness. Too much preference fine-tuning can make the model refuse anything remotely risky. Test on real user queries before shipping.
Reward hacking. Models learn to maximize the reward metric instead of being actually good. Example: output that "looks confident" scores high, even if wrong.
Sycophancy. RLHF models tend to agree with the user. Measure this explicitly.
Distribution shift. SFT on one distribution (InstructGPT prompts) does not transfer perfectly to your users (enterprise tickets, Torah study questions, etc.). Collect your own eval data.
KL penalty too low. Without the KL regularization (keeps the RLHF model close to the SFT model), the model can drift weirdly.
Skipping evals. Every post-training step needs a before/after eval set. Without that, you are guessing.

Alternatives / Comparisons

Approach	Complexity	When to use	Notes
SFT only	Low	Narrow tasks, low stakes	Fast, cheap, often sufficient
SFT + DPO	Medium	Most production assistants in 2026	Default for new projects
SFT + RLHF	High	Frontier labs, maximum control	Harder to stabilize
SFT + Constitutional AI	High	Anthropic-style alignment	Only if you have the infra
Best-of-N (reward model only)	Low	When inference cost allows	Skip the RL complexity entirely
Reasoning post-training (o1 style)	Very high	Reasoning-heavy tasks	Details guarded

Mini-lab

labs/dpo-on-your-data/ (to create) - use HuggingFace TRL to DPO-fine-tune a small model (Phi-3-mini or Llama 3 3B) on a small preference dataset of your own topic (e.g., Torah study Q&A style).

Steps:

Generate 100 (prompt, response_A, response_B) triples
Run DPO training with TRL
Compare before/after on a held-out eval set

Stack: Unsloth + TRL + Lora adapters, fits on a single 24GB GPU.

Post-Training (SFT + RLHF/DPO)

Post-Training (SFT + RLHF/DPO)

TL;DR

The historical problem

How it works

Step 1: Supervised Fine-Tuning (SFT)

Step 2: Preference Fine-Tuning

RLHF (original, OpenAI 2022)

DPO (Direct Preference Optimization, 2023)

The reward model detail

Best-of-N as a shortcut

Relevance today (2026)

1. DPO dominates RLHF for most teams

2. Synthetic data is the norm

3. Reasoning models need different post-training

4. Alignment is a moving target

5. Post-training democratized for fine-tuning

Critical questions

Production pitfalls

Alternatives / Comparisons

Mini-lab

Further reading