Test-Time Compute

TL;DR

Test-time compute is about spending more inference budget per query to improve quality. Instead of one shot, generate multiple candidates and pick the best. It unlocked a new era of reasoning models (o1, o3, Claude Opus 4.x thinking mode, DeepSeek R1). DeepMind showed this can sometimes beat scaling model size.

The historical problem

Pre-2024, LLM inference was one-shot: prompt in, one response out, minimal compute per token.

Problem: for hard tasks (math, code, multi-step reasoning), a single pass often fails. A 7B model at one shot cannot solve an AIME problem. But what if you let it try 400 times and pick the best?

Training a bigger model is expensive. What if inference compute could buy the same gains?

This question drove a major shift in 2024-2025.

How it works

The basic idea

For each query:

Generate N candidate responses (with different sampling or prompts)
Select the best one

Selection methods:

Highest probability: pick the response with the highest sequence probability (sum of logprobs). OpenAI API's n>1 uses this.
Reward model scoring: use a trained judge to pick the best
Verifier: for verifiable tasks (math, code), check which outputs pass tests
Majority vote: for tasks with exact answers, pick the most common output (self-consistency)
Heuristic: length, format, domain-specific rules
Show to user: let the human pick

Cost

Generating 2 outputs costs roughly 2x. 400 outputs costs 400x. This is why test-time compute is not free - you are trading dollars for quality.

Strategies beyond naive best-of-N

Beam search: at each generation step, keep the top B partial sequences. Cheaper than independent best-of-N for matching quality.
Self-consistency (Wang et al., 2022): sample N chain-of-thought responses, pick the most common final answer.
Tree of Thoughts (Yao et al., 2023): explore a tree of partial reasoning, backtrack from dead ends.
Process reward models (PRMs): score not just the final answer, but each reasoning step. Used by OpenAI and DeepMind.

The reasoning model revolution (2024-2026)

OpenAI o1 (Sep 2024) was the first frontier model explicitly trained to use test-time compute. Instead of responding, it first generates "thinking" tokens: internal reasoning the user does not see. Then responds.

Follow-ups:

OpenAI o3 (late 2024, 2025)
Claude Opus 4.x "thinking mode" (2025-2026)
DeepSeek R1 (late 2024) - open-weight, recipe public
Google Gemini 2.5 Thinking (2025)

These models have a thinking budget: a max number of internal reasoning tokens. More thinking, better answers, higher latency and cost.

Concrete effect:

GPT-4 (no reasoning): ~13% on AIME 2024 math
o1-preview: ~56%
o1: ~74%
o3 (full): ~96%

That jump was not from scaling parameters. It was from scaling inference compute during a post-trained RL loop on verified tasks.

DeepMind's claim

Snell et al. (2024) at DeepMind argued that optimally allocating test-time compute can match the quality gains from scaling model parameters, sometimes more efficiently. Specifically: a 100M-param model with a good verifier sampled ~400 times can match a 3B-param model at single-shot.

Tradeoff: latency. Test-time compute = more latency per response.

OpenAI's interesting finding (Cobbe et al., 2021)

Verifiers (models that score other models' outputs) can boost performance by an amount equivalent to a 30x model size increase. A 100M-param model with a verifier matches a 3B-param model without. This old finding foreshadowed reasoning models.

Relevance today (2026)

Huyen's coverage of test-time compute is brief (the concept existed pre-2024 but the revolution happened in Dec 2024 onwards). In 2026:

1. Reasoning models are a separate product category

Not just a feature, a whole model family:

Slower per response (seconds to minutes, not tokens/sec)
More expensive per token (by 3-10x in API pricing)
Qualitatively better on math, code, logic, planning
Worse for chat, creative writing, empathy

You pick a reasoning model when the task needs accuracy. You pick a normal model when latency or empathy matters.

2. The chain is starting to merge

By 2026, hybrid models appeared:

Claude Opus 4.5: one model, but with a "thinking mode" toggle
GPT-5: auto-routes between fast and thinking based on task difficulty
DeepSeek V3 + R1: V3 for chat, R1 for reasoning, same base model

Expect this convergence to continue. The distinction may fade by 2027.

3. The cost ceiling is steep

Sampling 400 times was interesting in research (OpenAI 2021). Nobody does it in production at scale. Practical budgets:

Simple tasks: 1 sample
Hard tasks: 4-16 samples
Critical math/code: 32-64 samples, with verifier
Research experiments: up to 10,000 samples (Stanford "Monkey Business", Brown et al. 2024, showed log-linear improvement up to 10K)

4. Inverse scaling at high N

OpenAI (2021) found performance decreases past 400 samples. Hypothesis: more samples = more chance the verifier is fooled by adversarial candidates. Stanford (2024) disagreed. The truth depends on verifier quality.

5. Application-specific selection is often best

For production:

Code generation: run tests, pick what passes (concrete verifier)
SQL generation: try executing, pick valid
Math: pick most common answer (self-consistency)
Chat: usually just 1 sample unless quality matters
Extract structured data from image: sample 3x, pick most consistent extraction

Huyen mentions Kittipat Kampa at TIFIN: they parallel-sample, return the first valid response to minimize perceived latency.

6. Reward model as a service

Anthropic Constitutional AI and similar systems use reward models continuously at inference. You do not need to train your own - fine-tuned judges are becoming available as APIs. Example: Cohere rerank, OpenAI judge APIs.

Critical questions

If test-time compute can match parameter scaling, why do labs still spend billions training bigger models?
For a production chat app, what's the crossover point where adding a 2nd sample is worth the latency hit?
How does self-consistency (majority vote) fail? When do all N samples agree but are all wrong?
Reasoning models hide thinking tokens. Is this a product decision or a technical limitation? Could you see them if you wanted?
Can test-time compute help smaller open-weight models compete with frontier? (Partially yes, still gaps.)

Production pitfalls

Unbounded sampling. Setting N=10 seems safe. Under load, 10x your cost at peak traffic. Plan capacity.
Weak verifier. If your selector is bad, sampling more just picks more bad outputs. Often the verifier is the bottleneck.
Latency surprise. Users expect chatbots to respond in 1-3 seconds. Reasoning models take 10-60 seconds. Surface this in UI (progress indicator).
Fuzzy verifiers for non-verifiable tasks. Code has tests. Math has answers. "Is this essay good?" has no ground truth. Test-time compute works less well here.
Context window of N samples in memory. Generating N candidates in parallel consumes N x the KV cache. See kv cache.
Hiding thinking tokens from billing. Some providers charge for thinking tokens, some do not. Read the pricing carefully.

Alternatives / Comparisons

Approach	When to use	Cost multiplier
Single-shot	Default for chat, fast tasks	1x
Best-of-N with reward model	Medium-quality bar, have a verifier	Nx
Beam search	Structured outputs, translation	2-5x
Self-consistency	Tasks with exact answers (math, multiple choice)	5-32x
Tree of Thoughts	Complex reasoning, planning	5-20x
Reasoning model (o1/R1 style)	Hard reasoning, verified tasks	3-10x cost + 10x latency
Process reward model	Research, frontier labs	Very high

Mini-lab

labs/test-time-compute/ (to create) - measure the compute-vs-quality curve on a simple benchmark (MATH, GSM8K, HumanEval) for a small local model:

Run N = 1, 4, 16, 64 samples
Use majority vote (for math) or pass@k (for code)
Plot accuracy vs compute cost
Find YOUR crossover point

Stack: uv + Ollama + a Python benchmark harness. Goal: feel the economics.

Test-Time Compute

Test-Time Compute

TL;DR

The historical problem

How it works

The basic idea

Cost

Strategies beyond naive best-of-N

The reasoning model revolution (2024-2026)

DeepMind's claim

OpenAI's interesting finding (Cobbe et al., 2021)

Relevance today (2026)

1. Reasoning models are a separate product category

2. The chain is starting to merge

3. The cost ceiling is steep

4. Inverse scaling at high N

5. Application-specific selection is often best

6. Reward model as a service

Critical questions

Production pitfalls

Alternatives / Comparisons

Mini-lab

Further reading