In one line: hallucination — when an AI states something fluent, confident and wrong — is structural, not a bug. Across 37 models in 2026, measured rates ran from 15% to 52%. The Truth Score is our editorial 0–100 rating of how reliably each model sticks to the truth, and it feeds the Safety & Reliability factor in our scoring.
The Truth Score table
Higher is better — a 96 means very few hallucinations; a 62 means frequent fabrication. Scores are editorial, grounded in published hallucination research and leaderboards.
| Model | Truth Score | Hallucination tendency |
|---|---|---|
| Claude Sonnet 4.6 | 96 | Lowest — ~3% on factual benchmarks |
| Claude Fable 5 | 93 | Very low, slight reasoning trade-off |
| Claude Opus 4.8 | 92 | Very low |
| Claude Haiku 4.5 | 90 | Low |
| Perplexity Pro | 90 | Low — retrieval-grounded with citations |
| Gemini 3.1 Pro | 84 | Low-moderate |
| Microsoft Copilot | 83 | Low-moderate |
| GPT-5.4 | 82 | Moderate |
| GPT-4o | 80 | Moderate — ~16% on PersonQA |
| Gemini 3 Flash | 80 | Moderate |
| Grok 4.1 | 79 | Moderate — ~12%, highest among top commercial |
| GPT-5.5 | 78 | Moderate (reasoning trade-off) |
| Gemini 3.1 Flash-Lite | 74 | Higher |
| Llama 4 | 74 | Higher (open-weight) |
| DeepSeek V3 | 68 | Higher (open-weight) |
| o3 | 62 | Highest — ~33% on PersonQA |
Sources: Vectara hallucination leaderboard, provider research disclosures, and published benchmark studies (PersonQA, TruthfulQA). Editorial synthesis — not a first-person test.
The confidence paradox: smarter isn't more honest
The most counter-intuitive finding of 2026: reasoning models hallucinate more on factual questions. OpenAI's o3 was measured hallucinating around 33% of the time on PersonQA — more than double the ~16% of its predecessor o1. The smaller o4-mini was worse again at 48%. More deliberate reasoning produces more elaborate, more confident, and sometimes more wrong answers. A model that reasons its way to a fabrication is more dangerous than one that simply admits it doesn't know.
Why hallucination is structural
Language models predict the most likely next token, not the true one. When the training data is thin or absent, the model fills the gap with something plausible. A 2025 result showed this is mathematically inevitable for models built this way — hallucination can be reduced and detected, but not eliminated. Any vendor claiming "zero hallucinations" is overselling.
It varies enormously by domain
A single model's hallucination rate is not one number — it depends entirely on the task.
| Domain | Typical hallucination rate | Implication |
|---|---|---|
| Summarisation | <1.5% | Safe — grounded in supplied text |
| General factual Q&A | 15–33% | Verify anything that matters |
| Medical case summaries | 43–64% | Human clinical review mandatory |
| Legal queries | 58–88% | Never rely on un-checked citations |
The lesson for business: a model that is safe for summarising your meeting notes can be dangerous for drafting a contract clause. Match the model — and the review process — to the risk.
The Confident Liar problem
Not all wrong answers are equal. A model that says "I'm not certain, but…" is far safer than one that fabricates with total confidence. We weight confident-wrong answers more harshly in the Truth Score than acknowledged uncertainty. A model that knows the edge of its knowledge is a safer business tool than a more capable model that bluffs.
How to reduce hallucinations
You can't remove them, but you can cut them dramatically. Ranked by measured impact:
| Technique | Reduction | When to use |
|---|---|---|
| Retrieval-augmented generation (RAG) | ~71% | Any factual or document-grounded task |
| Self-consistency checking | ~65% | High-stakes single answers |
| Ensemble checking (multi-model) | 30–50% | Critical decisions worth the extra cost |
| Prompt mitigation ("say if unsure") | ~22pp | Cheap baseline on every prompt |
| Fine-tuning on domain data | Varies | Narrow, repeated, specialised tasks |
The single most effective move is RAG: ground the model in your own verified sources so it retrieves rather than invents. Combined with a human review step on anything that matters, it turns an unreliable generalist into a dependable business tool.
What changed in June 2026
- The 37-model benchmark put measured hallucination at 15–52%, confirming the problem is universal, not vendor-specific.
- Reasoning models (o3, o4-mini) were shown to hallucinate more on facts — reframing the debate from "can we fix it" to "which model lies least".
- Claude Sonnet 4.6 held the lowest measured rate among frontier models at roughly 3%.
Choosing a model where accuracy is critical? Use the match engine and weight safety highly, or compare every model's full scorecard on the comparison table.