LLMs
01·LLMs·updated 2026-04-21

Test-Time Compute

Test-time compute is about spending more inference budget per query to improve quality. Instead of one shot, generate multiple candidates and pick the best. It unlocked a new era of **reasoning models** (o1, o3, Claude Opus 4.x thinking mode, DeepSeek R1). DeepMind showed this can sometimes beat scaling model size.

Test-Time Compute

TL;DR

Test-time compute is about spending more inference budget per query to improve quality. Instead of one shot, generate multiple candidates and pick the best. It unlocked a new era of reasoning models (o1, o3, Claude Opus 4.x thinking mode, DeepSeek R1). DeepMind showed this can sometimes beat scaling model size.

The historical problem

Pre-2024, LLM inference was one-shot: prompt in, one response out, minimal compute per token.

Problem: for hard tasks (math, code, multi-step reasoning), a single pass often fails. A 7B model at one shot cannot solve an AIME problem. But what if you let it try 400 times and pick the best?

Training a bigger model is expensive. What if inference compute could buy the same gains?

This question drove a major shift in 2024-2025.

How it works

The basic idea

For each query:

  1. Generate N candidate responses (with different sampling or prompts)
  2. Select the best one

Selection methods:

  • Highest probability: pick the response with the highest sequence probability (sum of logprobs). OpenAI API's n>1 uses this.
  • Reward model scoring: use a trained judge to pick the best
  • Verifier: for verifiable tasks (math, code), check which outputs pass tests
  • Majority vote: for tasks with exact answers, pick the most common output (self-consistency)
  • Heuristic: length, format, domain-specific rules
  • Show to user: let the human pick

Cost

Generating 2 outputs costs roughly 2x. 400 outputs costs 400x. This is why test-time compute is not free - you are trading dollars for quality.

Strategies beyond naive best-of-N

  • Beam search: at each generation step, keep the top B partial sequences. Cheaper than independent best-of-N for matching quality.
  • Self-consistency (Wang et al., 2022): sample N chain-of-thought responses, pick the most common final answer.
  • Tree of Thoughts (Yao et al., 2023): explore a tree of partial reasoning, backtrack from dead ends.
  • Process reward models (PRMs): score not just the final answer, but each reasoning step. Used by OpenAI and DeepMind.

The reasoning model revolution (2024-2026)

OpenAI o1 (Sep 2024) was the first frontier model explicitly trained to use test-time compute. Instead of responding, it first generates "thinking" tokens: internal reasoning the user does not see. Then responds.

Follow-ups:

  • OpenAI o3 (late 2024, 2025)
  • Claude Opus 4.x "thinking mode" (2025-2026)
  • DeepSeek R1 (late 2024) - open-weight, recipe public
  • Google Gemini 2.5 Thinking (2025)

These models have a thinking budget: a max number of internal reasoning tokens. More thinking, better answers, higher latency and cost.

Concrete effect:

  • GPT-4 (no reasoning): ~13% on AIME 2024 math
  • o1-preview: ~56%
  • o1: ~74%
  • o3 (full): ~96%

That jump was not from scaling parameters. It was from scaling inference compute during a post-trained RL loop on verified tasks.

DeepMind's claim

Snell et al. (2024) at DeepMind argued that optimally allocating test-time compute can match the quality gains from scaling model parameters, sometimes more efficiently. Specifically: a 100M-param model with a good verifier sampled ~400 times can match a 3B-param model at single-shot.

Tradeoff: latency. Test-time compute = more latency per response.

OpenAI's interesting finding (Cobbe et al., 2021)

Verifiers (models that score other models' outputs) can boost performance by an amount equivalent to a 30x model size increase. A 100M-param model with a verifier matches a 3B-param model without. This old finding foreshadowed reasoning models.

Relevance today (2026)

Huyen's coverage of test-time compute is brief (the concept existed pre-2024 but the revolution happened in Dec 2024 onwards). In 2026:

1. Reasoning models are a separate product category

Not just a feature, a whole model family:

  • Slower per response (seconds to minutes, not tokens/sec)
  • More expensive per token (by 3-10x in API pricing)
  • Qualitatively better on math, code, logic, planning
  • Worse for chat, creative writing, empathy

You pick a reasoning model when the task needs accuracy. You pick a normal model when latency or empathy matters.

2. The chain is starting to merge

By 2026, hybrid models appeared:

  • Claude Opus 4.5: one model, but with a "thinking mode" toggle
  • GPT-5: auto-routes between fast and thinking based on task difficulty
  • DeepSeek V3 + R1: V3 for chat, R1 for reasoning, same base model

Expect this convergence to continue. The distinction may fade by 2027.

3. The cost ceiling is steep

Sampling 400 times was interesting in research (OpenAI 2021). Nobody does it in production at scale. Practical budgets:

  • Simple tasks: 1 sample
  • Hard tasks: 4-16 samples
  • Critical math/code: 32-64 samples, with verifier
  • Research experiments: up to 10,000 samples (Stanford "Monkey Business", Brown et al. 2024, showed log-linear improvement up to 10K)

4. Inverse scaling at high N

OpenAI (2021) found performance decreases past 400 samples. Hypothesis: more samples = more chance the verifier is fooled by adversarial candidates. Stanford (2024) disagreed. The truth depends on verifier quality.

5. Application-specific selection is often best

For production:

  • Code generation: run tests, pick what passes (concrete verifier)
  • SQL generation: try executing, pick valid
  • Math: pick most common answer (self-consistency)
  • Chat: usually just 1 sample unless quality matters
  • Extract structured data from image: sample 3x, pick most consistent extraction

Huyen mentions Kittipat Kampa at TIFIN: they parallel-sample, return the first valid response to minimize perceived latency.

6. Reward model as a service

Anthropic Constitutional AI and similar systems use reward models continuously at inference. You do not need to train your own - fine-tuned judges are becoming available as APIs. Example: Cohere rerank, OpenAI judge APIs.

Critical questions

  • If test-time compute can match parameter scaling, why do labs still spend billions training bigger models?
  • For a production chat app, what's the crossover point where adding a 2nd sample is worth the latency hit?
  • How does self-consistency (majority vote) fail? When do all N samples agree but are all wrong?
  • Reasoning models hide thinking tokens. Is this a product decision or a technical limitation? Could you see them if you wanted?
  • Can test-time compute help smaller open-weight models compete with frontier? (Partially yes, still gaps.)

Production pitfalls

  • Unbounded sampling. Setting N=10 seems safe. Under load, 10x your cost at peak traffic. Plan capacity.
  • Weak verifier. If your selector is bad, sampling more just picks more bad outputs. Often the verifier is the bottleneck.
  • Latency surprise. Users expect chatbots to respond in 1-3 seconds. Reasoning models take 10-60 seconds. Surface this in UI (progress indicator).
  • Fuzzy verifiers for non-verifiable tasks. Code has tests. Math has answers. "Is this essay good?" has no ground truth. Test-time compute works less well here.
  • Context window of N samples in memory. Generating N candidates in parallel consumes N x the KV cache. See kv cache.
  • Hiding thinking tokens from billing. Some providers charge for thinking tokens, some do not. Read the pricing carefully.

Alternatives / Comparisons

ApproachWhen to useCost multiplier
Single-shotDefault for chat, fast tasks1x
Best-of-N with reward modelMedium-quality bar, have a verifierNx
Beam searchStructured outputs, translation2-5x
Self-consistencyTasks with exact answers (math, multiple choice)5-32x
Tree of ThoughtsComplex reasoning, planning5-20x
Reasoning model (o1/R1 style)Hard reasoning, verified tasks3-10x cost + 10x latency
Process reward modelResearch, frontier labsVery high

Mini-lab

labs/test-time-compute/ (to create) - measure the compute-vs-quality curve on a simple benchmark (MATH, GSM8K, HumanEval) for a small local model:

  1. Run N = 1, 4, 16, 64 samples
  2. Use majority vote (for math) or pass@k (for code)
  3. Plot accuracy vs compute cost
  4. Find YOUR crossover point

Stack: uv + Ollama + a Python benchmark harness. Goal: feel the economics.

Further reading

  • "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" (Snell et al., DeepMind, 2024)
  • "Training Verifiers to Solve Math Word Problems" (Cobbe et al., OpenAI, 2021) - the foundational verifier paper
  • "Self-Consistency Improves Chain of Thought Reasoning" (Wang et al., Google, 2022)
  • "Tree of Thoughts" (Yao et al., 2023)
  • "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling" (Brown et al., Stanford, 2024) - the 10,000-sample experiment
  • DeepSeek R1 paper (2024) - open recipe for reasoning post-training
  • OpenAI o1 system card (2024)
test-time-computereasoninginferencebest-of-nbeam-searcho1r1