Scaling Laws

TL;DR

Scaling laws describe how model quality improves with three inputs: model size (parameters), dataset size (tokens), and compute (FLOPs). The Chinchilla law (DeepMind 2022) showed you need roughly 20 tokens per parameter for compute-optimal training. In 2026, over-training is the norm and test-time compute is a new scaling dimension Huyen did not cover.

The historical problem

Before 2020, "bigger is better" was the dominant ML belief. But nobody had a formula for how much data you need for a given model size. Teams either:

Trained huge models on too little data (wasted compute, undertrained)
Trained small models on too much data (wasted compute, diminishing returns)

OpenAI's scaling laws paper (Kaplan et al., 2020) gave the first formalization. DeepMind's Chinchilla paper (2022) corrected it and is still the reference today.

Without scaling laws, a team with a $10M compute budget has no idea what model size to aim for. The laws turn compute budgeting from alchemy into math.

How it works

The three scale levers

Every foundation model has three numbers that describe its scale:

Lever	Proxy for	Example (GPT-3)
Parameters (N)	Learning capacity	175B
Training tokens (D)	How much the model learned	300B
Compute (FLOPs)	Training cost	3.14 x 10^23 FLOPs

Parameters vs tokens vs compute

These are NOT independent:

More parameters need more tokens or they overfit
More tokens need more compute to process
Compute ties them together: FLOPs ~ 6 x N x D for a dense transformer

Chinchilla scaling law (the reference)

DeepMind trained 400 models from 70M to 16B parameters on 5B to 500B tokens. Finding:

For compute-optimal training, number of training tokens should be roughly 20 times the number of parameters.

Examples applied:

7B model -> 140B training tokens
70B model -> 1.4T training tokens (Chinchilla itself)
175B model -> 3.5T training tokens

Every time you double the model size, you should double the training data. Keeps the ratio constant.

Compute economics

Training cost = FLOPs / (FLOP/s of your hardware) / utilization

Real example from Huyen:

GPT-3-175B: 3.14 x 10^23 FLOPs
256 H100 GPUs at 5.2 x 10^18 FLOPs/day each
At 70% utilization: ~236 days of training
At $2/H100/hour: over $4 million just in compute

Inverse scaling (when bigger is worse)

Not always true that bigger is better. Anthropic (2022) showed more alignment training made models LESS aligned with human preference on some axes. The Inverse Scaling Prize (2023) found 11 tasks where bigger models lose, mostly:

Tasks with strong priors (models trust their memorized priors over new input)
Memorization tasks

But these are edge cases. Bigger is better ~99% of the time.

Relevance today (2026)

Huyen's Chinchilla explanation is the textbook baseline. Reality in 2026 has moved:

1. Over-training is now the norm

Chinchilla is compute-optimal for training. But in production, inference cost dominates total lifecycle cost. So teams over-train small models to reduce inference cost later.

Examples (2024-2026):

Llama 3 8B: trained on 15T tokens. Chinchilla ratio would be 160B. That is 94x over-trained.
Phi-4 14B: trained on ~10T tokens of curated + synthetic data. ~70x over-trained.
DeepSeek-V3 37B active / 671B total: trained on 14.8T tokens with MoE architecture.

Why? A small over-trained model:

Is cheaper to serve (less KV cache, fewer FLOPs per token)
Fits on consumer GPUs
Is easier to fine-tune

Sardana et al. (2023) extended Chinchilla to account for inference demand. Gives different (smaller) optimal model sizes depending on how much you will run the model in production.

2. Test-time compute is a new scaling axis

In 2024-2025, reasoning models (OpenAI o1, o3, Claude Opus 4.x thinking mode, DeepSeek R1, Gemini 2.5 Thinking) introduced a fourth lever: compute at inference time.

DeepMind's "Scaling Test-Time Compute" (Snell et al., 2024) argued that allocating more compute at inference can match the quality gains from scaling model size, sometimes more efficiently.

Concretely: a 100M-param model with a good verifier sampled 400 times can match a 3B-param model with one sample (OpenAI verifier paper, 2021).

Huyen mentions test-time compute but the full shift happened AFTER the book was published. See test time compute.

3. Data quality beats quantity

Huyen briefly notes data quality matters. In 2026, it matters MORE than quantity past a certain point.

Phi series (Microsoft): beat much larger models using heavily curated "textbooks are all you need" data
Synthetic data for rare skills (math, code, reasoning) via teacher-student (bigger model generates training data for smaller model)
RedPajama-v2 has 30T tokens, but most are low-quality web scrape. High-quality subsets are much smaller

Rule: 1T carefully curated tokens often beats 10T raw web tokens.

4. MoE breaks simple Chinchilla math

Mixture-of-Experts models (Mixtral, DeepSeek-V3, Llama 4 MoE variants) have two parameter counts:

Active parameters: used per token (matters for inference speed)
Total parameters: model capacity (matters for quality)

Chinchilla was derived for dense models. MoE scaling laws are active research (DeepMind, Meta papers 2024-2025). Rough current guidance: scale total params like dense, but inference economics follow active params.

5. Cost per quality is still dropping fast

Stanford HAI 2022 report: cost to reach 93% ImageNet accuracy halved 2019-2021. Trend continues:

GPT-4-level inference cost dropped 10-100x from 2023 to 2026
Training a GPT-3-class model in 2026 costs ~$100K-$500K, not $4M

Don't plan budgets on 2024 numbers.

Critical questions

Chinchilla says 20 tokens per parameter. Llama 3 uses 1800+. Are both "compute-optimal" or is one wrong?
If test-time compute works, why not train small models and pour compute into inference? (Answer: works until it doesn't, verifier quality caps gains.)
MoE total vs active parameters: should your KV cache budget be sized to which?
How do scaling laws apply to fine-tuning? (Short answer: they mostly do not, fine-tuning has its own dynamics.)
Inverse scaling is real but edge-case. What does it imply for alignment? (Debate: more alignment training can cause weirder biases.)

Production pitfalls

Budgeting on pure Chinchilla. Production models care about inference cost, not training cost. Use Sardana et al. or similar inference-aware formulas.
Ignoring data quality. Throwing 10T raw web tokens at a model is often WORSE than 1T curated tokens. Data team matters more than GPU budget past a point.
Confusing FLOPs with FLOP/s. FLOPs = total floating point operations (training cost). FLOP/s = throughput (hardware speed). Huyen's book catches this, many blogs do not.
Utilization assumptions too optimistic. 70% utilization is great, not normal. Plan for 40-50% on a fresh team.
Ignoring MoE subtleties. Mixtral 8x7B has 47B params but runs like a 13B model. Pricing and memory planning differ.

Alternatives / Comparisons

Scaling strategy	When it wins	When it loses
Compute-optimal (Chinchilla)	Training a frontier model once	Small team with heavy inference load
Over-training small models	Inference-heavy production, open-source distribution	Research frontier models
Scaling test-time compute	Hard reasoning tasks with verifiers	Latency-sensitive apps
Scaling data quality	Specialized domains, when compute is scarce	Pure frontier chase
MoE scaling	Large models at manageable inference cost	Simple tasks (overhead wasted)

Mini-lab

labs/scaling-laws-calculator/ (to create) - build a small calculator that, given a compute budget in FLOPs, outputs:

Chinchilla-optimal parameter count and token count
Estimated training time for different GPU clusters
Estimated inference cost for N requests per day
Over-trained small model alternative

Suggested input: uv run calculator.py --compute 1e22 --gpus 64 --gpu-type H100

Goal: feel the relationship between compute budget and design choices.

Scaling Laws

Scaling Laws

TL;DR

The historical problem

How it works

The three scale levers

Parameters vs tokens vs compute

Chinchilla scaling law (the reference)

Compute economics

Inverse scaling (when bigger is worse)

Relevance today (2026)

1. Over-training is now the norm

2. Test-time compute is a new scaling axis

3. Data quality beats quantity

4. MoE breaks simple Chinchilla math

5. Cost per quality is still dropping fast

Critical questions

Production pitfalls

Alternatives / Comparisons

Mini-lab

Further reading