Hallucinations
A hallucination is when a model generates content that is not grounded in facts. Two dominant hypotheses: (1) self-delusion (the model cannot distinguish its own output from given facts), (2) knowledge mismatch (SFT teaches the model to mimic answers that use facts the model itself does not know). Mitigation: retrieval, citations, reward models that penalize fabrication.
Hallucinations
TL;DR
A hallucination is when a model generates content that is not grounded in facts. Two dominant hypotheses: (1) self-delusion (the model cannot distinguish its own output from given facts), (2) knowledge mismatch (SFT teaches the model to mimic answers that use facts the model itself does not know). Mitigation: retrieval, citations, reward models that penalize fabrication.
The historical problem
Hallucination predates LLMs. Goyal et al. (2016) already discussed it in generative text. But it exploded in prominence with ChatGPT because:
- Models became confident in natural language (not just "probability 0.7")
- Users asked factual questions en masse
- Failures were public and consequential
Famous incident: June 2023, a law firm submitted a court brief with fictitious case citations generated by ChatGPT. The firm was fined. Lawyers now face a whole new liability category.
Hallucinations are the main reason factual tasks (legal, medical, financial) still need human oversight when using LLMs.
How it works
Inconsistency vs hallucination
Huyen distinguishes two related issues:
| Issue | Definition | Cause |
|---|---|---|
| Inconsistency | Same prompt, different outputs | Sampling randomness |
| Hallucination | Output not grounded in facts | Deeper, see below |
Inconsistency can be reduced with low temperature and caching. Hallucination is harder.
Hypothesis 1: self-delusion (DeepMind, Ortega et al., 2021)
A language model cannot tell apart:
- Data it is given in the prompt (truth from the user)
- Data it generated itself (its own prior output)
Example: You ask "Who is Chip Huyen?" The model starts: "Chip Huyen is an architect." (wrong, she is an AI researcher/author).
Now the model sees the sequence: "Who is Chip Huyen? Chip Huyen is an architect."
It treats "Chip Huyen is an architect" as input, not as something it just made up. The next tokens it generates are conditioned on this false premise. It elaborates: "She designed the Beijing airport terminal..." etc. Snowballing.
Zhang et al. (2023) call this snowballing hallucinations. Once a model states a false fact, subsequent output escalates.
Hypothesis 2: knowledge mismatch (Leo Gao, OpenAI)
SFT teaches the model to mimic labeler responses. Labelers write answers using their own knowledge.
Example: a labeler writes "Chip Huyen runs a successful AI consulting firm called Claypot AI." The labeler knows this (say, from LinkedIn). The base model never saw this in training. But SFT teaches the model to produce similar fluent answers, even when it does not KNOW them.
Net effect: the model learns the SHAPE of confident answers, not the FACTS. It will confidently make up plausible-sounding information because that is what its training pushed it to produce.
John Schulman (OpenAI) made a deeper claim: LLMs know if they know something. If true, you can train them to say "I don't know" when uncertain. Schulman proposed two fixes:
- Verification: ask the model to cite sources for each claim
- Better reward function: punish fabrication specifically during RLHF
OpenAI's InstructGPT paper showed RLHF actually made hallucinations WORSE in some experiments. But humans still preferred RLHF outputs overall. So there are tradeoffs.
Why temperature does not fix it
A common misconception: "lower temperature = less hallucination". FALSE.
Temperature affects randomness in sampling, not factuality. A model hallucinating confidently at T=1 will hallucinate the SAME THING confidently at T=0. It just does so deterministically.
Relevance today (2026)
Huyen's two hypotheses are still the main framework in 2026. Updates:
1. RAG is the dominant mitigation
By 2026, retrieval-augmented generation is the default for factual use cases. Instead of asking the model to recall facts, you retrieve them from a trusted source and inject them into the prompt.
Details in chunking, vector stores.
When done well:
- Claude with citations (Anthropic 2024)
- OpenAI search with source attribution (GPT-4 with search)
- Perplexity (search + cited LLM)
All report major hallucination reduction on factual tasks when retrieval is reliable.
2. Reasoning models hallucinate less (and sometimes more)
Reasoning models (o1, R1, Claude Opus 4.5 thinking) hallucinate less on math and code because they can verify during thinking.
But they can hallucinate MORE on general knowledge if their internal "thinking" strays into confabulation. Anthropic's own research (2025) showed thinking-mode outputs need extra calibration.
3. Constitutional AI and self-critique
Anthropic's Constitutional AI has the model critique and revise its own outputs against a written constitution. Reduces some hallucinations (especially refusal-related) but not purely factual ones.
Similar self-critique pipelines (from Anthropic, OpenAI, Google): generate answer -> critique -> revise. Catches some errors, misses others.
4. Calibration and confidence signals
Some models now surface uncertainty:
- OpenAI's
response.choices[0].logprobslets you compute confidence scores - Structured outputs that force the model to declare "confidence: low/medium/high"
- Claude's willingness to say "I'm not sure" when prompted correctly
But this is brittle. A confidently hallucinated fact still scores high confidence.
5. Detection is an open problem
Huyen says detection is hard. In 2026:
- Chain-of-Verification (DeepMind 2023): ask the model to list facts, verify each
- SelfCheckGPT (Manakul et al., 2023): sample N responses, check internal consistency
- Retrieval-based fact-checking: after generation, retrieve sources and check claims
- Specialized classifiers: HHEM (Vectara's Hallucination Evaluation Model) scores factuality
None are perfect. In high-stakes apps (legal, medical), human review is still required.
6. The Chip Huyen self-demo
Notable: Huyen uses her own name as an example. Asking any model "Who is Chip Huyen?" shows mixed results. Sometimes it gets her right (author, Stanford, etc.), sometimes it fabricates wildly. The example still holds as of 2026.
Critical questions
- If retrieval fixes hallucinations, why do we still need models to have world knowledge? Why not just always retrieve?
- The two hypotheses (self-delusion, knowledge mismatch) seem contradictory in their fixes. Do they describe the same phenomenon from different angles?
- RLHF can make hallucinations WORSE (InstructGPT paper). Why ship RLHF anyway?
- Can a model be trained to ALWAYS say "I don't know" when uncertain? What would that cost in helpfulness?
- For Torah Study AI: which hypothesis matters more? (Answer: both, since the model must refuse to fabricate rabbinical sources.)
Production pitfalls
- Assuming temperature=0 removes hallucination. It does not. See sampling and temperature.
- Trusting confident outputs. The model's tone has no correlation with its correctness. "The capital of Australia is Sydney" (wrong) is as confident as "The capital of Australia is Canberra" (correct).
- RAG without citation verification. Retrieving garbage and injecting it into the prompt yields garbage outputs with a false sense of grounding.
- Self-check loops that agree with themselves. Asking the same model to verify its own output often repeats the same hallucination. Use a different model or a retrieval step.
- Demo-to-prod gap in factual apps. LLMs hallucinate rarely on common queries (they memorized those). They hallucinate freely on rare or niche queries. Demos skew to common. Real user traffic breaks the illusion.
- No refusal training for your domain. The model will answer a medical question even if unqualified. Add domain-specific refusal guidance in SFT or system prompt.
Alternatives / Comparisons
| Mitigation | Effectiveness | Cost | When to use |
|---|---|---|---|
| RAG with trusted sources | High for factual | Medium (retrieval infra) | Most factual apps |
| Citation requirement in prompt | Medium | Free | Default practice |
| Chain-of-Verification | Medium-high | 2-3x tokens | Verified research, high-stakes |
| Constitutional AI / self-critique | Medium for safety, low for facts | 2-5x tokens | Alignment-critical apps |
| Temperature = 0 | Marginal | Free | Consistency (not factuality) |
| Lower token budget | Low | Saves money | Reduces opportunity to stray |
| Fine-tune on domain | High for narrow domains | Expensive | Legal, medical, finance |
| Human review | Highest | Expensive | High-stakes decisions |
Mini-lab
labs/hallucination-detection/ (to create) - build a simple hallucination detector:
- Generate N responses to a set of factual questions with varying temperatures
- For each question, sample 5 responses from the model
- Measure self-consistency (do they agree?)
- Compare inconsistent answers to a ground truth
- Plot: self-inconsistency correlates with factual error?
Goal: see SelfCheckGPT's intuition with your own eyes.
Stack: uv + a local model via Ollama + a question bank (e.g., TriviaQA subset).
Further reading
- "Shaking the foundations: delusions in sequence models for interaction and control" (Ortega et al., DeepMind, 2021) - self-delusion hypothesis
- "How Language Model Hallucinations Can Snowball" (Zhang et al., 2023)
- John Schulman, UC Berkeley talk (2023) - knowledge-mismatch hypothesis
- "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection" (Manakul et al., 2023)
- Chain-of-Verification (CoVe) paper (Dhuliawala et al., DeepMind, 2023)
- Anthropic Constitutional AI (Bai et al., 2022)
- Vectara HHEM leaderboard - ongoing comparison of hallucination rates per model
- Huyen, Chapter 2 - the version this notion extends; Chapter 4 covers detection