Hallucinations

TL;DR

A hallucination is when a model generates content that is not grounded in facts. Two dominant hypotheses: (1) self-delusion (the model cannot distinguish its own output from given facts), (2) knowledge mismatch (SFT teaches the model to mimic answers that use facts the model itself does not know). Mitigation: retrieval, citations, reward models that penalize fabrication.

The historical problem

Hallucination predates LLMs. Goyal et al. (2016) already discussed it in generative text. But it exploded in prominence with ChatGPT because:

Models became confident in natural language (not just "probability 0.7")
Users asked factual questions en masse
Failures were public and consequential

Famous incident: June 2023, a law firm submitted a court brief with fictitious case citations generated by ChatGPT. The firm was fined. Lawyers now face a whole new liability category.

Hallucinations are the main reason factual tasks (legal, medical, financial) still need human oversight when using LLMs.

How it works

Inconsistency vs hallucination

Huyen distinguishes two related issues:

Issue	Definition	Cause
Inconsistency	Same prompt, different outputs	Sampling randomness
Hallucination	Output not grounded in facts	Deeper, see below

Inconsistency can be reduced with low temperature and caching. Hallucination is harder.

Hypothesis 1: self-delusion (DeepMind, Ortega et al., 2021)

A language model cannot tell apart:

Data it is given in the prompt (truth from the user)
Data it generated itself (its own prior output)

Example: You ask "Who is Chip Huyen?" The model starts: "Chip Huyen is an architect." (wrong, she is an AI researcher/author).

Now the model sees the sequence: "Who is Chip Huyen? Chip Huyen is an architect."

It treats "Chip Huyen is an architect" as input, not as something it just made up. The next tokens it generates are conditioned on this false premise. It elaborates: "She designed the Beijing airport terminal..." etc. Snowballing.

Zhang et al. (2023) call this snowballing hallucinations. Once a model states a false fact, subsequent output escalates.

Hypothesis 2: knowledge mismatch (Leo Gao, OpenAI)

SFT teaches the model to mimic labeler responses. Labelers write answers using their own knowledge.

Example: a labeler writes "Chip Huyen runs a successful AI consulting firm called Claypot AI." The labeler knows this (say, from LinkedIn). The base model never saw this in training. But SFT teaches the model to produce similar fluent answers, even when it does not KNOW them.

Net effect: the model learns the SHAPE of confident answers, not the FACTS. It will confidently make up plausible-sounding information because that is what its training pushed it to produce.

John Schulman (OpenAI) made a deeper claim: LLMs know if they know something. If true, you can train them to say "I don't know" when uncertain. Schulman proposed two fixes:

Verification: ask the model to cite sources for each claim
Better reward function: punish fabrication specifically during RLHF

OpenAI's InstructGPT paper showed RLHF actually made hallucinations WORSE in some experiments. But humans still preferred RLHF outputs overall. So there are tradeoffs.

Why temperature does not fix it

A common misconception: "lower temperature = less hallucination". FALSE.

Temperature affects randomness in sampling, not factuality. A model hallucinating confidently at T=1 will hallucinate the SAME THING confidently at T=0. It just does so deterministically.

Relevance today (2026)

Huyen's two hypotheses are still the main framework in 2026. Updates:

1. RAG is the dominant mitigation

By 2026, retrieval-augmented generation is the default for factual use cases. Instead of asking the model to recall facts, you retrieve them from a trusted source and inject them into the prompt.

Details in chunking, vector stores.

When done well:

Claude with citations (Anthropic 2024)
OpenAI search with source attribution (GPT-4 with search)
Perplexity (search + cited LLM)

All report major hallucination reduction on factual tasks when retrieval is reliable.

2. Reasoning models hallucinate less (and sometimes more)

Reasoning models (o1, R1, Claude Opus 4.5 thinking) hallucinate less on math and code because they can verify during thinking.

But they can hallucinate MORE on general knowledge if their internal "thinking" strays into confabulation. Anthropic's own research (2025) showed thinking-mode outputs need extra calibration.

3. Constitutional AI and self-critique

Anthropic's Constitutional AI has the model critique and revise its own outputs against a written constitution. Reduces some hallucinations (especially refusal-related) but not purely factual ones.

Similar self-critique pipelines (from Anthropic, OpenAI, Google): generate answer -> critique -> revise. Catches some errors, misses others.

4. Calibration and confidence signals

Some models now surface uncertainty:

OpenAI's response.choices[0].logprobs lets you compute confidence scores
Structured outputs that force the model to declare "confidence: low/medium/high"
Claude's willingness to say "I'm not sure" when prompted correctly

But this is brittle. A confidently hallucinated fact still scores high confidence.

5. Detection is an open problem

Huyen says detection is hard. In 2026:

Chain-of-Verification (DeepMind 2023): ask the model to list facts, verify each
SelfCheckGPT (Manakul et al., 2023): sample N responses, check internal consistency
Retrieval-based fact-checking: after generation, retrieve sources and check claims
Specialized classifiers: HHEM (Vectara's Hallucination Evaluation Model) scores factuality

None are perfect. In high-stakes apps (legal, medical), human review is still required.

6. The Chip Huyen self-demo

Notable: Huyen uses her own name as an example. Asking any model "Who is Chip Huyen?" shows mixed results. Sometimes it gets her right (author, Stanford, etc.), sometimes it fabricates wildly. The example still holds as of 2026.

Critical questions

If retrieval fixes hallucinations, why do we still need models to have world knowledge? Why not just always retrieve?
The two hypotheses (self-delusion, knowledge mismatch) seem contradictory in their fixes. Do they describe the same phenomenon from different angles?
RLHF can make hallucinations WORSE (InstructGPT paper). Why ship RLHF anyway?
Can a model be trained to ALWAYS say "I don't know" when uncertain? What would that cost in helpfulness?
For a domain-specific RAG app (legal, medical, scholarly sources): which hypothesis matters more? (Answer: both, because the app must refuse to fabricate sources AND resist snowballing into confident-sounding nonsense.)

Production pitfalls

Assuming temperature=0 removes hallucination. It does not. See sampling and temperature.
Trusting confident outputs. The model's tone has no correlation with its correctness. "The capital of Australia is Sydney" (wrong) is as confident as "The capital of Australia is Canberra" (correct).
RAG without citation verification. Retrieving garbage and injecting it into the prompt yields garbage outputs with a false sense of grounding.
Self-check loops that agree with themselves. Asking the same model to verify its own output often repeats the same hallucination. Use a different model or a retrieval step.
Demo-to-prod gap in factual apps. LLMs hallucinate rarely on common queries (they memorized those). They hallucinate freely on rare or niche queries. Demos skew to common. Real user traffic breaks the illusion.
No refusal training for your domain. The model will answer a medical question even if unqualified. Add domain-specific refusal guidance in SFT or system prompt.

Alternatives / Comparisons

Mitigation	Effectiveness	Cost	When to use
RAG with trusted sources	High for factual	Medium (retrieval infra)	Most factual apps
Citation requirement in prompt	Medium	Free	Default practice
Chain-of-Verification	Medium-high	2-3x tokens	Verified research, high-stakes
Constitutional AI / self-critique	Medium for safety, low for facts	2-5x tokens	Alignment-critical apps
Temperature = 0	Marginal	Free	Consistency (not factuality)
Lower token budget	Low	Saves money	Reduces opportunity to stray
Fine-tune on domain	High for narrow domains	Expensive	Legal, medical, finance
Human review	Highest	Expensive	High-stakes decisions

Mini-lab

labs/hallucination-detection/ (to create) - build a simple hallucination detector:

Generate N responses to a set of factual questions with varying temperatures
For each question, sample 5 responses from the model
Measure self-consistency (do they agree?)
Compare inconsistent answers to a ground truth
Plot: self-inconsistency correlates with factual error?

Goal: see SelfCheckGPT's intuition with your own eyes.

Stack: uv + a local model via Ollama + a question bank (e.g., TriviaQA subset).

Hallucinations

Hallucinations

TL;DR

The historical problem

How it works

Inconsistency vs hallucination

Hypothesis 1: self-delusion (DeepMind, Ortega et al., 2021)

Hypothesis 2: knowledge mismatch (Leo Gao, OpenAI)

Why temperature does not fix it

Relevance today (2026)

1. RAG is the dominant mitigation

2. Reasoning models hallucinate less (and sometimes more)

3. Constitutional AI and self-critique

4. Calibration and confidence signals

5. Detection is an open problem

6. The Chip Huyen self-demo

Critical questions

Production pitfalls

Alternatives / Comparisons

Mini-lab

Further reading