Sampling and Temperature

TL;DR

Sampling is how a model picks the next token from the probability distribution over its vocabulary. The main knobs are temperature (flatten or sharpen probabilities), top-k (limit to the k most likely tokens), and top-p (limit to the smallest set whose cumulative probability exceeds p). These knobs control the creativity vs consistency trade-off.

The historical problem

A transformer outputs a logit vector: one number per token in the vocabulary (say, 100,000 numbers for GPT-4). Logits are just raw scores, not probabilities. They can be negative, do not sum to 1.

To pick the next token, you need a strategy:

Greedy: always pick the highest logit. Deterministic but boring. Every answer will use the most common words.
Sample: pick according to probabilities. More creative but inconsistent.

Without smart sampling, LLMs sound robotic ("My favorite color is blue") or go off the rails (random rare tokens). Sampling strategies shape output quality.

How it works

Logits to probabilities: softmax

The softmax function turns logits into a probability distribution:

p_i = exp(x_i) / sum_j(exp(x_j))

For each token in the vocab, compute exp(logit), divide by the sum. Result: probabilities that sum to 1.

Temperature

Divide logits by a constant T BEFORE applying softmax:

p_i = softmax(x_i / T)

Effect:

T = 1: standard softmax
T > 1: flattens the distribution. Rare tokens get more probability. More creative, sometimes incoherent.
T < 1: sharpens the distribution. Dominant tokens dominate more. More consistent, sometimes boring.
T -> 0: practically equivalent to greedy (always pick highest logit)
T = 0 in APIs: implementation skips softmax, just takes argmax

Concrete example (logits = [1, 2] for tokens A and B):

T = 1: probabilities are [0.27, 0.73]. B picked 73% of the time.
T = 0.5: probabilities are [0.12, 0.88]. B picked 88%.
T = 2: probabilities approach [0.5, 0.5].

Rule of thumb:

Factual tasks, classification, structured output: T = 0 or very low
General assistant: T = 0.7 is a common balance
Creative writing, brainstorming: T = 0.9-1.2

Most provider APIs cap T at 2.

Top-k sampling

After computing logits, keep only the top K values, redistribute probabilities over them (softmax over just those K), then sample.

Small K (10-40): predictable text
Medium K (50-200): good general default
Large K (500+): more diverse but approaches full sampling

Computational benefit: softmax over the full 100K vocab is expensive. Top-k reduces it to a manageable set.

Top-p (nucleus) sampling

More adaptive than top-k. Sort tokens by probability in descending order. Take the smallest set whose cumulative probability exceeds p. Sample from that set.

p = 0.9 (default in many APIs): includes tokens covering 90% of probability mass
p = 0.95: slightly broader
p = 0.5: very restrictive, nearly deterministic

Why it is often preferred over top-k: the set size adapts. For clear questions ("yes or no"), it shrinks to 2 tokens. For open questions, it grows to dozens.

Min-p (newer variant)

Set a minimum probability threshold for a token to be considered. Less used than top-p but gaining traction in open-weight communities.

Logprobs

Many APIs expose logprobs: the log of probabilities. Used for:

Confidence signals (how sure was the model?)
Structured output validation
Classification by picking the highest-probability class token

Log scale avoids underflow: a 100K-vocab model has probabilities so tiny they round to zero in float. Logs stay well-behaved.

Stopping conditions

You also need to decide WHEN to stop generating:

Max tokens: hard cutoff (risk: cuts mid-sentence)
End-of-sequence token: model emits <EOS>, we stop
Stop strings: stop if the model emits a specific string (e.g., "\nUser:" for dialogue turn separation)
Grammar-bound stopping: for structured outputs, stop when grammar says the output is complete

Relevance today (2026)

Huyen's explanation of temperature, top-k, top-p is timeless. What shifted since Dec 2024:

1. Temperature 0 is not deterministic (the hard truth)

Teams assumed T=0 = reproducible output. Not quite:

Same model, same prompt, T=0 can give different outputs across requests
Causes: GPU non-determinism (floating point ordering), batched-inference effects, KV cache interaction
OpenAI and Anthropic do not guarantee bit-exact determinism even at T=0
To get reproducibility, you also need: same hardware, same batch size, same softmax implementation, ideally seed parameter

Lesson: T=0 gives consistency, not determinism. Test for both.

2. Logprobs are sometimes hidden

Huyen mentions limited logprobs APIs. In 2026:

OpenAI exposes top-20 logprobs
Anthropic exposes none at time of writing
Open-weight models (vLLM, SGLang) expose everything
Security reason: logprobs leak model internals, making reverse-engineering easier

If you need logprobs for classification or calibration, check API access first.

3. Reasoning models change the game

OpenAI o1, Claude Opus 4.5 thinking, DeepSeek R1 separate "thinking" tokens from "response" tokens. Temperature settings often apply differently to each phase. Some APIs hide thinking tokens entirely.

See test time compute for the bigger picture.

4. Structured output changed sampling too

Tools like Outlines, Guidance, xgrammar implement constrained sampling: filter logits to only allow tokens that satisfy a grammar (JSON schema, regex, BNF). This is orthogonal to temperature/top-k/top-p but layered on top.

See structured outputs.

5. New heuristics like DRY and XTC

Open-weight communities experiment with:

DRY (Don't Repeat Yourself): penalizes recently-repeated phrases
XTC (Exclude Top Choices): flips the usual logic, excludes the most likely tokens to force creativity
MirrorStat / Typical Sampling: theoretical information-theoretic approaches

None are standard. But worth knowing they exist if you run your own models (vLLM, llama.cpp support some).

Critical questions

Why does T=0 not guarantee determinism? Which layer introduces the randomness?
If top-p is more adaptive, why do many production systems still use top-k?
What's the difference between lowering temperature to 0.3 vs setting top-p to 0.3? They feel similar but are not the same.
Why do reasoning models often recommend specific temperatures (DeepSeek R1: 0.6) rather than 0?
Logprobs are "security sensitive". How sensitive, really? What attack do they enable?

Production pitfalls

Assuming determinism at T=0. You can get different outputs for the same prompt. Cache responses if you need exact repeat.
Combining T=1 with greedy-looking outputs. T=1 with a sharp distribution (confident model) still looks greedy. If you want diversity, you may need higher T plus top-p.
Forgetting stop tokens in structured outputs. JSON generation with no max_tokens and a broken stop condition can burn 100K tokens of garbage.
Mismatched tokenizers in logprob handling. Computing logprobs on tokens that differ from the model's internal tokenization gives nonsense.
Temperature 0 for creative tasks. The model will pick the same boring opener every time. Use T=0.7+ for variety.
Using temperature to "fix" hallucinations. Lower T does NOT reduce hallucinations, just makes them consistent. See hallucinations.

Alternatives / Comparisons

Strategy	Knob	Good for	Avoid for
Greedy (T=0)	None	Classification, structured	Creative text
Temperature	T = 0.7-1.0	General chat	Need consistency
Top-k	K = 50-100	Balanced generation	Very short answers
Top-p (nucleus)	p = 0.9-0.95	Adaptive sampling	Fine-grained control
Min-p	p = 0.05-0.1	Truncate long tail	Main strategy in isolation
Constrained sampling	Grammar	Structured output (JSON, SQL)	Free-form text
Beam search	Beam width	Translation, exact answers	Chat, diversity

Stack them: typical production config is temperature=0.7, top_p=0.95, top_k=100 together.

Mini-lab

labs/sampling-playground/ (to create) - run the same prompt through a local model (Phi-3 or Llama 3 via Ollama) with varied parameters. Visualize:

Output diversity across 10 runs at each setting
Token probability distributions at different temperatures
Latency impact of different sampling configs

Stack: uv + transformers + a small plotting lib.

Goal: internalize the "feel" of each knob.

Sampling and Temperature

Sampling and Temperature

TL;DR

The historical problem

How it works

Logits to probabilities: softmax

Temperature

Top-k sampling

Top-p (nucleus) sampling

Min-p (newer variant)

Logprobs

Stopping conditions

Relevance today (2026)

1. Temperature 0 is not deterministic (the hard truth)

2. Logprobs are sometimes hidden

3. Reasoning models change the game

4. Structured output changed sampling too

5. New heuristics like DRY and XTC

Critical questions

Production pitfalls

Alternatives / Comparisons

Mini-lab

Further reading