The Truth Score ยท Updated June 2026

Which AI tells the truth?

The Truth Score rates each model on how rarely it hallucinates. Claude Sonnet 4.6 leads at roughly a 3% hallucination rate. The surprise: models built to reason harder often hallucinate more on facts, not less.

In one line: hallucination — when an AI states something fluent, confident and wrong — is structural, not a bug. Across 37 models in 2026, measured rates ran from 15% to 52%. The Truth Score is our editorial 0–100 rating of how reliably each model sticks to the truth, and it feeds the Safety & Reliability factor in our scoring.

The Truth Score table

Higher is better — a 96 means very few hallucinations; a 62 means frequent fabrication. Scores are editorial, grounded in published hallucination research and leaderboards.

ModelTruth ScoreHallucination tendency
Claude Sonnet 4.696Lowest — ~3% on factual benchmarks
Claude Fable 593Very low, slight reasoning trade-off
Claude Opus 4.892Very low
Claude Haiku 4.590Low
Perplexity Pro90Low — retrieval-grounded with citations
Gemini 3.1 Pro84Low-moderate
Microsoft Copilot83Low-moderate
GPT-5.482Moderate
GPT-4o80Moderate — ~16% on PersonQA
Gemini 3 Flash80Moderate
Grok 4.179Moderate — ~12%, highest among top commercial
GPT-5.578Moderate (reasoning trade-off)
Gemini 3.1 Flash-Lite74Higher
Llama 474Higher (open-weight)
DeepSeek V368Higher (open-weight)
o362Highest — ~33% on PersonQA

Sources: Vectara hallucination leaderboard, provider research disclosures, and published benchmark studies (PersonQA, TruthfulQA). Editorial synthesis — not a first-person test.

The confidence paradox: smarter isn't more honest

The most counter-intuitive finding of 2026: reasoning models hallucinate more on factual questions. OpenAI's o3 was measured hallucinating around 33% of the time on PersonQA — more than double the ~16% of its predecessor o1. The smaller o4-mini was worse again at 48%. More deliberate reasoning produces more elaborate, more confident, and sometimes more wrong answers. A model that reasons its way to a fabrication is more dangerous than one that simply admits it doesn't know.

Why hallucination is structural

Language models predict the most likely next token, not the true one. When the training data is thin or absent, the model fills the gap with something plausible. A 2025 result showed this is mathematically inevitable for models built this way — hallucination can be reduced and detected, but not eliminated. Any vendor claiming "zero hallucinations" is overselling.

It varies enormously by domain

A single model's hallucination rate is not one number — it depends entirely on the task.

DomainTypical hallucination rateImplication
Summarisation<1.5%Safe — grounded in supplied text
General factual Q&A15–33%Verify anything that matters
Medical case summaries43–64%Human clinical review mandatory
Legal queries58–88%Never rely on un-checked citations

The lesson for business: a model that is safe for summarising your meeting notes can be dangerous for drafting a contract clause. Match the model — and the review process — to the risk.

The Confident Liar problem

Not all wrong answers are equal. A model that says "I'm not certain, but…" is far safer than one that fabricates with total confidence. We weight confident-wrong answers more harshly in the Truth Score than acknowledged uncertainty. A model that knows the edge of its knowledge is a safer business tool than a more capable model that bluffs.

How to reduce hallucinations

You can't remove them, but you can cut them dramatically. Ranked by measured impact:

TechniqueReductionWhen to use
Retrieval-augmented generation (RAG)~71%Any factual or document-grounded task
Self-consistency checking~65%High-stakes single answers
Ensemble checking (multi-model)30–50%Critical decisions worth the extra cost
Prompt mitigation ("say if unsure")~22ppCheap baseline on every prompt
Fine-tuning on domain dataVariesNarrow, repeated, specialised tasks

The single most effective move is RAG: ground the model in your own verified sources so it retrieves rather than invents. Combined with a human review step on anything that matters, it turns an unreliable generalist into a dependable business tool.

What changed in June 2026

Choosing a model where accuracy is critical? Use the match engine and weight safety highly, or compare every model's full scorecard on the comparison table.