Why do AI models Hallucinate?

AI robot neural network

This is a summary fo the Build Wiz AI podcast and a paper on why models hallucinate.

https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4aaa5/why-language-models-hallucinate.pdf

  • Hallucinations Defined: Language models “hallucinate” when they confidently produce plausible but incorrect statements, especially when uncertain, instead of admitting they don’t know.

  • Statistical Origins: Hallucinations originate from statistical pressures in model training. Even with perfectly clean data, the objectives in pretraining (like cross-entropy minimization) naturally cause models to guess when uncertain, leading to errors.

  • Binary Classification Link: The paper ties the problem to binary classification: generation of new text (output) is statistically harder than verifying whether a statement is valid or not; therefore, generative error rates are at least twice the misclassification rates of validity checking tasks.

  • Singleton Rate: Rare facts (those appearing only once in training data, “singletons”) are a major source of hallucination. If 20% of facts are singletons in the data, at least 20% of them are likely to be hallucinated.

  • Poor Model Architecture: Errors can arise not only from missing information, but also from limitations in the model architecture—such as n-gram models failing to capture context or modern architectures failing to count letters correctly due to tokenization.

  • Persistent Errors Post-Training: Techniques like RLHF and similar post-training methods reduce some hallucinations, especially for harmful or widely known misinformation, but confident falsehoods persist because the underlying incentives in benchmarks reward guessing.

  • Benchmark Incentives: Most popular benchmarks and leaderboards (e.g., MMLU, GPQA) use binary scoring (right = 1 point, wrong/abstain = 0 points). This system incentivizes models to guess rather than admit uncertainty, which makes hallucinations more common.

  • Socio-Technical Solution: The authors propose that to curb hallucinations, the scoring of dominant benchmarks must be changed to penalize overconfident mistakes and reward calibrated, uncertainty-aware responses. Explicit confidence targets in evaluation instructions are recommended.

  • Other Contributing Factors: Computational difficulty, distribution shifts (out-of-domain queries), and errors in training data (“garbage in, garbage out”) can also produce hallucinations, but the main driver is misaligned evaluation incentives.

  • Broader Implications: Reliable AI requires not only improved architectures and training, but also rethinking how models are evaluated and incentivized, ensuring models are rewarded for accuracy and honest uncertainty.

The podcast provides an in-depth explanation of why advanced AI language models sometimes produce confident but incorrect answers—“hallucinations”—and what can be done to address these. Here is a bullet-point summary of the main points:

  • AI hallucination: Large language models often generate plausible but factually wrong information, rather than admitting they don’t know something.

  • Exam analogy: LLMs act like students in exams, often guessing answers instead of leaving them blank because their evaluation systems reward attempting answers, even if unsure.https://podcasts.apple.com/il/podcast/why-language-models-hallucinate/id1799918505?i=1000725417087

  • Statistical roots: Hallucination is fundamentally tied to the training process and the statistical objectives of language modeling, not just to poor data quality.

  • Intrinsic errors: Some mistakes happen even for simple facts directly stated in the prompt due to limitations in model architecture or in how information is represented during training.

  • Singleton rate: Facts that appear only once in training data (like a specific birthday) are easily forgotten or misrepresented by the model.

  • Evaluation incentives: Current scoring systems (one point for correct, zero for wrong or “I don’t know”) encourage models to guess rather than honestly abstain, promoting overconfident errors.

  • Solution proposed: The authors suggest changing evaluation metrics—penalize wrong answers more and reward admissions of uncertainty, using confidence-based scoring to encourage honesty.

  • Other contributors: Computational difficulty, unfamiliar or out-of-distribution prompts, and errors in training data also play a role, but reward structure is the key problem.

  • Retrieval augmentation: Connecting models to the internet helps, but doesn’t fix the root causes unless the scoring incentives change.

  • Implications: These insights matter because trusted, reliable AI is vital as it becomes more integrated into everyday life and decision-making.