LLMs
01·LLMs·updated 2026-04-21

Post-Training (SFT + RLHF/DPO)

Post-training turns a raw pre-trained language model into something usable. Two main steps: **Supervised Fine-Tuning (SFT)** teaches the model to follow instructions, then **Preference Fine-Tuning (RLHF, DPO)** aligns it with human preferences. Every modern assistant (ChatGPT, Claude, Gemini) goes through this pipeline.

Post-Training (SFT + RLHF/DPO)

TL;DR

Post-training turns a raw pre-trained language model into something usable. Two main steps: Supervised Fine-Tuning (SFT) teaches the model to follow instructions, then Preference Fine-Tuning (RLHF, DPO) aligns it with human preferences. Every modern assistant (ChatGPT, Claude, Gemini) goes through this pipeline.

The historical problem

A pre-trained language model is just a "completion machine". Given "How to make pizza", it may respond:

  1. "for a family of six?" (adding context)
  2. "What ingredients do I need? How much time?" (adding questions)
  3. Actual instructions on how to make pizza

Only option 3 is useful as an assistant. Pre-training alone does not teach the model that it should answer, be safe, be polite, refuse certain requests, or format responses properly.

In 2020, GPT-3 raw was hard to use directly. OpenAI's breakthrough in 2022 was InstructGPT (SFT + RLHF), which made the model a useful chatbot. ChatGPT was InstructGPT with a UI on top.

Every production LLM since then goes through post-training.

How it works

Step 1: Supervised Fine-Tuning (SFT)

Also called "behavior cloning" or "instruction tuning".

You show the model examples of (prompt, good_response) pairs. The model is trained to produce the response given the prompt. This teaches the shape of useful answers.

Called demonstration data:

PromptLabeler response
"Use 'serendipity' in a sentence.""Running into Margaret and being introduced to Tom was a fortunate stroke of serendipity."
"ELI5: cause of the anxiety lump in chest?""The lump is caused by muscular tension keeping your glottis dilated..."

The data must cover the range of tasks you want: QA, summarization, translation, refusals, multi-turn dialogue, etc.

Economics from InstructGPT (Huyen):

  • ~13,000 (prompt, response) pairs
  • $10 per pair for expert labelers (90% college-educated)
  • $130,000 just in labeling
  • Plus designing tasks, QA, etc.

This is expensive. Alternatives:

  • LAION volunteer approach (free but biased)
  • DeepMind filtered dialogues from internet data
  • Synthetic data: generate (prompt, response) pairs with a stronger model (now dominant in 2026)

Step 2: Preference Fine-Tuning

After SFT, the model answers, but answers can still be unhelpful, verbose, harmful, or wrong. Preference fine-tuning shapes HOW the model answers.

The question: should the model write an essay claiming one race is inferior? Everyone agrees: no. Should it discuss gun control, abortion, Israel-Palestine? Opinions diverge wildly. Preference fine-tuning is "embedding universal human preference into AI", which Huyen notes is "ambitious, if not impossible".

Two dominant algorithms:

RLHF (original, OpenAI 2022)

Reinforcement Learning from Human Feedback. Two parts:

Part A: train a reward model.

  • Labelers compare pairs: given (prompt, response_A, response_B), which is better?
  • Labels are comparison data (prompt, winning_response, losing_response)
  • Reward model learns to score responses
  • Cost per comparison: ~$3.50 (Llama 2 team), vs $25 for writing a response from scratch
  • Inter-labeler agreement: ~73% (OpenAI)

Part B: fine-tune the SFT model against the reward model.

  • Sample prompts, generate responses with the SFT model
  • Score each response with the reward model
  • Use PPO (Proximal Policy Optimization) to update the model to maximize scores
  • This is the "reinforcement learning" part

DPO (Direct Preference Optimization, 2023)

Rafailov et al. showed you can skip the reward model entirely. DPO directly trains the model to prefer winning responses over losing ones, using comparison data.

Advantages over RLHF:

  • No separate reward model to train
  • No RL loop (simpler implementation)
  • Less compute
  • Less prone to reward hacking

Meta switched from RLHF (Llama 2) to DPO (Llama 3) to reduce complexity.

The reward model detail

The reward model is usually fine-tuned from the same base model as the LLM being trained. Some teams believe the reward model should be at least as strong as the foundation model to score it reliably. Others say a weaker judge can evaluate a stronger model, since "judging is easier than generation" (which Huyen covers more in Chapter 3 on evaluation).

Best-of-N as a shortcut

Some teams (Stitch Fix, Grab) skip the RL loop entirely. They use the reward model to score N samples at inference time and pick the best. Cheaper, simpler, and often enough. Related to test time compute.

Relevance today (2026)

Huyen's treatment of SFT and RLHF is solid. Updates since Dec 2024:

1. DPO dominates RLHF for most teams

Huyen featured RLHF because it is more flexible. In 2026, DPO and its variants (IPO, KTO, ORPO, SimPO) are the default for most teams. RLHF stays for frontier labs (OpenAI, Anthropic) who want maximum control.

Variants active in 2026:

  • ORPO (Odds Ratio Preference Optimization, 2024): combines SFT and DPO in one step
  • SimPO (2024): no reference model needed, simpler
  • KTO (Kahneman-Tversky Optimization, 2024): does not need paired data, just good/bad labels

2. Synthetic data is the norm

Labeled (prompt, response) pairs at $10 each is 2022 economics. In 2026:

  • Frontier models generate demonstration data for smaller models (distillation)
  • Self-generated preference data (model generates 2 responses, a stronger model judges)
  • Constitutional AI (Anthropic): model critiques and revises its own outputs against a written constitution

Llama 4, Phi-4, DeepSeek-V3 all lean heavily on synthetic post-training data.

3. Reasoning models need different post-training

OpenAI o1 and o3, Claude Opus 4.x thinking mode, and DeepSeek R1 introduced a new post-training recipe focused on reasoning traces:

  • Huge datasets of (problem, chain-of-thought, answer)
  • Reinforcement learning on verified outputs (math, code)
  • Explicit "thinking" token budget during inference

The traditional SFT + DPO is not enough for these. Full recipe details are guarded. DeepSeek R1's paper (2024) is the most open window.

4. Alignment is a moving target

Anthropic's 2022 finding (more alignment training can decrease alignment) was an early warning. In 2026:

  • RLHF can introduce subtle biases (sycophancy, over-refusal, verbose outputs)
  • Models can learn to game verifiers (reward hacking)
  • "Helpful, harmless, honest" pull in tension
  • Constitutional AI and debate-based alignment are research frontiers

5. Post-training democratized for fine-tuning

In 2022, only big labs could do full RLHF. In 2026:

  • Axolotl, Unsloth, TRL let a solo engineer fine-tune + DPO on a 4090 GPU
  • LoRA + DPO stacks well. You can fine-tune a 70B model's behavior on consumer hardware
  • HuggingFace TRL has implementations of every preference algorithm

Critical questions

  • Why is RLHF more "flexible" than DPO? What specifically can you do with RLHF that DPO cannot?
  • If synthetic data is good enough, why does anyone still pay labelers?
  • Can a weak reward model train a strong policy? (Theoretical issues vs practical results differ.)
  • Preference fine-tuning "embeds human preference" - whose preference? OpenAI labelers in SF? The answer shapes the model's worldview.
  • Is the ultimate goal to eliminate SFT and RLHF by making pre-training better?

Production pitfalls

  • Over-alignment reduces usefulness. Too much preference fine-tuning can make the model refuse anything remotely risky. Test on real user queries before shipping.
  • Reward hacking. Models learn to maximize the reward metric instead of being actually good. Example: output that "looks confident" scores high, even if wrong.
  • Sycophancy. RLHF models tend to agree with the user. Measure this explicitly.
  • Distribution shift. SFT on one distribution (InstructGPT prompts) does not transfer perfectly to your users (enterprise tickets, Torah study questions, etc.). Collect your own eval data.
  • KL penalty too low. Without the KL regularization (keeps the RLHF model close to the SFT model), the model can drift weirdly.
  • Skipping evals. Every post-training step needs a before/after eval set. Without that, you are guessing.

Alternatives / Comparisons

ApproachComplexityWhen to useNotes
SFT onlyLowNarrow tasks, low stakesFast, cheap, often sufficient
SFT + DPOMediumMost production assistants in 2026Default for new projects
SFT + RLHFHighFrontier labs, maximum controlHarder to stabilize
SFT + Constitutional AIHighAnthropic-style alignmentOnly if you have the infra
Best-of-N (reward model only)LowWhen inference cost allowsSkip the RL complexity entirely
Reasoning post-training (o1 style)Very highReasoning-heavy tasksDetails guarded

Mini-lab

labs/dpo-on-your-data/ (to create) - use HuggingFace TRL to DPO-fine-tune a small model (Phi-3-mini or Llama 3 3B) on a small preference dataset of your own topic (e.g., Torah study Q&A style).

Steps:

  1. Generate 100 (prompt, response_A, response_B) triples
  2. Run DPO training with TRL
  3. Compare before/after on a held-out eval set

Stack: Unsloth + TRL + Lora adapters, fits on a single 24GB GPU.

Further reading

  • "Training language models to follow instructions with human feedback" (Ouyang et al., OpenAI, 2022) - InstructGPT, the RLHF reference paper
  • "Direct Preference Optimization" (Rafailov et al., 2023) - DPO paper
  • "Constitutional AI" (Bai et al., Anthropic, 2022) - alternative alignment
  • "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2024) - open reasoning post-training recipe
  • Chip Huyen, Chapter 2 - the version this notion extends
  • HuggingFace TRL docs - for hands-on implementation
post-trainingsftrlhfdpoalignmentfinetuning