Sampling and Temperature
Sampling is how a model picks the next token from the probability distribution over its vocabulary. The main knobs are **temperature** (flatten or sharpen probabilities), **top-k** (limit to the k most likely tokens), and **top-p** (limit to the smallest set whose cumulative probability exceeds p). These knobs control the creativity vs consistency trade-off.
Sampling and Temperature
TL;DR
Sampling is how a model picks the next token from the probability distribution over its vocabulary. The main knobs are temperature (flatten or sharpen probabilities), top-k (limit to the k most likely tokens), and top-p (limit to the smallest set whose cumulative probability exceeds p). These knobs control the creativity vs consistency trade-off.
The historical problem
A transformer outputs a logit vector: one number per token in the vocabulary (say, 100,000 numbers for GPT-4). Logits are just raw scores, not probabilities. They can be negative, do not sum to 1.
To pick the next token, you need a strategy:
- Greedy: always pick the highest logit. Deterministic but boring. Every answer will use the most common words.
- Sample: pick according to probabilities. More creative but inconsistent.
Without smart sampling, LLMs sound robotic ("My favorite color is blue") or go off the rails (random rare tokens). Sampling strategies shape output quality.
How it works
Logits to probabilities: softmax
The softmax function turns logits into a probability distribution:
p_i = exp(x_i) / sum_j(exp(x_j))
For each token in the vocab, compute exp(logit), divide by the sum. Result: probabilities that sum to 1.
Temperature
Divide logits by a constant T BEFORE applying softmax:
p_i = softmax(x_i / T)
Effect:
- T = 1: standard softmax
- T > 1: flattens the distribution. Rare tokens get more probability. More creative, sometimes incoherent.
- T < 1: sharpens the distribution. Dominant tokens dominate more. More consistent, sometimes boring.
- T -> 0: practically equivalent to greedy (always pick highest logit)
- T = 0 in APIs: implementation skips softmax, just takes argmax
Concrete example (logits = [1, 2] for tokens A and B):
- T = 1: probabilities are [0.27, 0.73]. B picked 73% of the time.
- T = 0.5: probabilities are [0.12, 0.88]. B picked 88%.
- T = 2: probabilities approach [0.5, 0.5].
Rule of thumb:
- Factual tasks, classification, structured output: T = 0 or very low
- General assistant: T = 0.7 is a common balance
- Creative writing, brainstorming: T = 0.9-1.2
Most provider APIs cap T at 2.
Top-k sampling
After computing logits, keep only the top K values, redistribute probabilities over them (softmax over just those K), then sample.
- Small K (10-40): predictable text
- Medium K (50-200): good general default
- Large K (500+): more diverse but approaches full sampling
Computational benefit: softmax over the full 100K vocab is expensive. Top-k reduces it to a manageable set.
Top-p (nucleus) sampling
More adaptive than top-k. Sort tokens by probability in descending order. Take the smallest set whose cumulative probability exceeds p. Sample from that set.
- p = 0.9 (default in many APIs): includes tokens covering 90% of probability mass
- p = 0.95: slightly broader
- p = 0.5: very restrictive, nearly deterministic
Why it is often preferred over top-k: the set size adapts. For clear questions ("yes or no"), it shrinks to 2 tokens. For open questions, it grows to dozens.
Min-p (newer variant)
Set a minimum probability threshold for a token to be considered. Less used than top-p but gaining traction in open-weight communities.
Logprobs
Many APIs expose logprobs: the log of probabilities. Used for:
- Confidence signals (how sure was the model?)
- Structured output validation
- Classification by picking the highest-probability class token
Log scale avoids underflow: a 100K-vocab model has probabilities so tiny they round to zero in float. Logs stay well-behaved.
Stopping conditions
You also need to decide WHEN to stop generating:
- Max tokens: hard cutoff (risk: cuts mid-sentence)
- End-of-sequence token: model emits
<EOS>, we stop - Stop strings: stop if the model emits a specific string (e.g.,
"\nUser:"for dialogue turn separation) - Grammar-bound stopping: for structured outputs, stop when grammar says the output is complete
Relevance today (2026)
Huyen's explanation of temperature, top-k, top-p is timeless. What shifted since Dec 2024:
1. Temperature 0 is not deterministic (the hard truth)
Teams assumed T=0 = reproducible output. Not quite:
- Same model, same prompt, T=0 can give different outputs across requests
- Causes: GPU non-determinism (floating point ordering), batched-inference effects, KV cache interaction
- OpenAI and Anthropic do not guarantee bit-exact determinism even at T=0
- To get reproducibility, you also need: same hardware, same batch size, same softmax implementation, ideally
seedparameter
Lesson: T=0 gives consistency, not determinism. Test for both.
2. Logprobs are sometimes hidden
Huyen mentions limited logprobs APIs. In 2026:
- OpenAI exposes top-20 logprobs
- Anthropic exposes none at time of writing
- Open-weight models (vLLM, SGLang) expose everything
- Security reason: logprobs leak model internals, making reverse-engineering easier
If you need logprobs for classification or calibration, check API access first.
3. Reasoning models change the game
OpenAI o1, Claude Opus 4.5 thinking, DeepSeek R1 separate "thinking" tokens from "response" tokens. Temperature settings often apply differently to each phase. Some APIs hide thinking tokens entirely.
See test time compute for the bigger picture.
4. Structured output changed sampling too
Tools like Outlines, Guidance, xgrammar implement constrained sampling: filter logits to only allow tokens that satisfy a grammar (JSON schema, regex, BNF). This is orthogonal to temperature/top-k/top-p but layered on top.
See structured outputs.
5. New heuristics like DRY and XTC
Open-weight communities experiment with:
- DRY (Don't Repeat Yourself): penalizes recently-repeated phrases
- XTC (Exclude Top Choices): flips the usual logic, excludes the most likely tokens to force creativity
- MirrorStat / Typical Sampling: theoretical information-theoretic approaches
None are standard. But worth knowing they exist if you run your own models (vLLM, llama.cpp support some).
Critical questions
- Why does T=0 not guarantee determinism? Which layer introduces the randomness?
- If top-p is more adaptive, why do many production systems still use top-k?
- What's the difference between lowering temperature to 0.3 vs setting top-p to 0.3? They feel similar but are not the same.
- Why do reasoning models often recommend specific temperatures (DeepSeek R1: 0.6) rather than 0?
- Logprobs are "security sensitive". How sensitive, really? What attack do they enable?
Production pitfalls
- Assuming determinism at T=0. You can get different outputs for the same prompt. Cache responses if you need exact repeat.
- Combining T=1 with greedy-looking outputs. T=1 with a sharp distribution (confident model) still looks greedy. If you want diversity, you may need higher T plus top-p.
- Forgetting stop tokens in structured outputs. JSON generation with no max_tokens and a broken stop condition can burn 100K tokens of garbage.
- Mismatched tokenizers in logprob handling. Computing logprobs on tokens that differ from the model's internal tokenization gives nonsense.
- Temperature 0 for creative tasks. The model will pick the same boring opener every time. Use T=0.7+ for variety.
- Using temperature to "fix" hallucinations. Lower T does NOT reduce hallucinations, just makes them consistent. See hallucinations.
Alternatives / Comparisons
| Strategy | Knob | Good for | Avoid for |
|---|---|---|---|
| Greedy (T=0) | None | Classification, structured | Creative text |
| Temperature | T = 0.7-1.0 | General chat | Need consistency |
| Top-k | K = 50-100 | Balanced generation | Very short answers |
| Top-p (nucleus) | p = 0.9-0.95 | Adaptive sampling | Fine-grained control |
| Min-p | p = 0.05-0.1 | Truncate long tail | Main strategy in isolation |
| Constrained sampling | Grammar | Structured output (JSON, SQL) | Free-form text |
| Beam search | Beam width | Translation, exact answers | Chat, diversity |
Stack them: typical production config is temperature=0.7, top_p=0.95, top_k=100 together.
Mini-lab
labs/sampling-playground/ (to create) - run the same prompt through a local model (Phi-3 or Llama 3 via Ollama) with varied parameters. Visualize:
- Output diversity across 10 runs at each setting
- Token probability distributions at different temperatures
- Latency impact of different sampling configs
Stack: uv + transformers + a small plotting lib.
Goal: internalize the "feel" of each knob.
Further reading
- Huyen, Chapter 2 - the version this notion extends
- "The Curious Case of Neural Text Degeneration" (Holtzman et al., 2019) - original top-p paper
- OpenAI sampling parameters docs: https://platform.openai.com/docs/api-reference/chat
- "Watermarking" papers (Kirchenbauer et al., 2023+) - sampling-based model fingerprinting
- vLLM sampling parameters (for open-weight serving): https://docs.vllm.ai