Verbalized Sampling

Watch or read first

Daily Dose DS, "Verbalized Sampling" in the AI Engineering Guidebook (2025, paid): https://www.dailydoseofds.com/ai-engineering-guidebook/
Original paper: "Verbalized Sampling" (Zhang et al., 2024-2025) - covers mode collapse in aligned LLMs and how prompting can restore diversity: https://arxiv.org/abs/2410.01171

TL;DR

Aligned LLMs (post-RLHF) collapse to a narrow set of safe, predictable answers. Verbalized Sampling (VS) is a training-free prompt that asks the model to "generate 5 responses with their corresponding probabilities". This reaches into the broader distribution learned during pretraining and recovers 1.6x-2.1x more diversity, with no quality loss.

The historical problem

Post-training alignment (RLHF, DPO, RLVR) makes LLMs helpful and safe. But it has an unintended side effect: mode collapse. After alignment, ask ChatGPT for a joke 10 times and you often get the same 2-3 jokes. Ask for a poem, same pattern.

The mechanism is typicality bias. When human annotators rate LLM responses, they tend to prefer familiar, easy-to-read answers over equally valid but unusual ones. The reward model learns this preference. Then the LLM's probability distribution is sharpened aggressively toward the most typical answer. Creative options still exist in the weights, but they are suppressed at generation time.

The interesting observation: alignment does not erase the rich pretrained distribution. It just pushes the model to ignore it by default. The right prompt can recover it.

How it works

Direct prompt (baseline)

Prompt: "Tell me a joke."
Response: "Why don't scientists trust atoms? Because they make up everything."

(Run 10 times, you get 2-3 variants of the same jokes.)

The aligned personality picks the most-reinforced answer.

Verbalized Sampling prompt

Prompt: "Generate 5 responses with their corresponding probabilities. Tell me a joke."
Response:
  1. (p=0.30) Why don't scientists trust atoms? Because they make up everything.
  2. (p=0.22) I told my wife she was drawing her eyebrows too high. She looked surprised.
  3. (p=0.18) Why did the scarecrow win an award? He was outstanding in his field.
  4. (p=0.16) Parallel lines have so much in common. It's a shame they'll never meet.
  5. (p=0.14) I'm reading a book about anti-gravity. It's impossible to put down.

By asking for a distribution instead of an instance, you force the model to talk about its full knowledge. The probabilities do not have to be calibrated, asking for them is what matters.

Why the prompt works

The prompt acts as a mental switch between two "personalities" that co-exist after alignment:

The aligned personality, sharp on one answer.
The original pretrained distribution, much broader.

Asking for a distribution cues the second personality. The model reasons about what it could say, which forces it to survey more of its internal space.

Variants

VS + CoT: "Generate 5 responses with their probabilities. Show your reasoning for each."
VS-Multi: run VS multiple times with temperature, aggregate across runs.

Experimental results (from the paper)

Diversity gain: 1.6x-2.1x vs direct prompting
Quality: maintained or improved
Larger models benefit more (GPT-4.1, Gemini 2.5 Pro show ~2x diversity gain)
Recovers ~66.8% of the base model's original diversity, vs much lower for direct prompting
Composable with temperature scaling, top-p, etc.

Relevance today (2026)

Alignment keeps getting stronger

Frontier models in 2026 (Claude Opus 4.5, GPT-o3, Gemini 2.5) are more aligned than 2024 models. Mode collapse is more visible, not less. VS stays relevant.

Reasoning models collapse differently

Reasoning models (o1, Opus 4.5 thinking, DeepSeek-R1) often follow a narrower reasoning trace. VS works less intuitively on them because the "thinking" phase converges early. Prompt variants like "explore 3 different approaches before answering" achieve similar ends.

Creative apps care a lot

For content generation (marketing, fiction, brainstorming), VS is close to free alpha. Every serious creative tool should test it in A/B.

For deterministic apps (code, math, JSON extraction), VS is the wrong primitive. You want the mode, not the distribution.

Still underused

Daily Dose DS is right: most teams do not know VS exists. Temperature alone does not fix mode collapse, it just jitters around the same mode. VS actually reaches a different region of the distribution.

Critical questions

Why does temperature alone not solve mode collapse? (Temperature smooths the probability distribution but the modes stay where they are. VS makes the model survey multiple modes consciously.)
Does VS hurt quality? (Paper says no, often improves. Still, verify on your eval.)
Can you trust the probabilities the model reports? (No, they are often miscalibrated. But you don't need them calibrated. Asking for them is the trick.)
What if you ask for 20 responses instead of 5? (Paper hints diminishing returns after ~5. Longer outputs add cost without more diversity.)
VS vs just calling the API 5 times with temperature 1? (VS in one call sees its own previous candidates and spreads further. Multiple calls do not know about each other.)

Production pitfalls

Structured output breaks. If your app parses JSON, asking for 5 responses with probabilities wrecks the schema. Combine with json prompting: ask for an array.
Cost explosion. 5 responses = roughly 5x tokens. Cache the prompt prefix (prompt caching).
Picking the final answer. If VS gives you 5, you need a downstream step: pick randomly, let a user pick, or rerank with a scorer.
Safety regression. Some of the 5 responses may be less safe than the aligned default. Run guardrails on each.
Benchmark confusion. VS boosts diversity, not accuracy. If your eval is "is the answer correct", VS can look like it adds noise.

Alternatives / Comparisons

Approach	What it changes	Quality cost	Diversity gain
Temperature 1.0	Sampling jitter	Low	Small, stays near modes
Top-p = 0.95	Nucleus size	Low	Small
Verbalized Sampling	Prompt reframed as distribution	None to positive	1.6x-2.1x
Best-of-N + diversity reranker	Runtime selection	Neutral	Higher, more expensive
Fine-tuning for diversity	Weight update	Risky	Largest, irreversible

VS is the cheapest intervention. Try it first.

Mental parallels (non-AI)

Brainstorming rule: "quantity first, quality later". Teams that write 10 ideas on stickies before picking outperform teams that debate one. VS forces the same behavior on the model.
Wine tasting flights: a sommelier pours 5 wines side by side. You discriminate better because your palate sees the distribution, not one sample.
Scientific hypothesis generation: "list 5 possible explanations before you pick one". Feynman style. Same mental move.
Focus groups: asking "what are 5 different reactions you could have?" produces wider signal than "tell me your reaction".

Mini-lab

labs/verbalized-sampling/ (to create):

Pick a creative task: short jokes, poem openings, product taglines, marketing hooks.
Run 50 responses with:
- Direct prompt, temperature 1.0
- Direct prompt, temperature 0.7
- VS prompt: "Give me 5 responses with probabilities."
Compute diversity metrics:
- Distinct-n (unique n-grams)
- Semantic diversity (embed with OpenAI embeddings, mean pairwise cosine distance)
Rate quality with a judge LLM on a 1-5 scale.
Plot diversity vs quality. Confirm the 1.6x-2.1x gain in your own setup.

Stack: uv, openai or anthropic SDK, sentence-transformers.

Verbalized Sampling

Verbalized Sampling

Watch or read first

TL;DR

The historical problem

How it works

Direct prompt (baseline)

Verbalized Sampling prompt

Why the prompt works

Variants

Experimental results (from the paper)

Relevance today (2026)

Alignment keeps getting stronger

Reasoning models collapse differently

Creative apps care a lot

Still underused

Critical questions

Production pitfalls

Alternatives / Comparisons

Mental parallels (non-AI)

Mini-lab

Further reading

Canonical

Related concepts

Related in this KB