Which AI makes things up the most?

Among current models, reasoning-focused and open-weight models hallucinate most. OpenAI's o3 was measured hallucinating around 33% of the time on the PersonQA benchmark, with o4-mini higher still at 48%. Open-source models can exceed 30%. Claude Sonnet 4.6 hallucinates least among frontier models at roughly 3%.

What causes AI hallucinations?

Large language models predict the most likely next token, not the truth. When training data is thin, ambiguous or absent, the model fills the gap with a plausible-sounding fabrication. A 2025 result showed hallucination is mathematically inevitable for these models, so it can be reduced but not fully eliminated.

How do you stop AI hallucinations?

You cannot eliminate them, but you can cut them sharply. Retrieval-augmented generation (RAG) reduces hallucinations by around 71%, self-consistency checking by around 65%, prompt mitigation by around 22 percentage points, and ensemble checking by 30-50%. Keeping a human in the loop on anything important is essential.

Can AI hallucinations be eliminated?

No. A 2025 mathematical result showed that hallucination is a structural property of how language models generate text, not a bug to be patched. The realistic goal is to minimise and detect hallucinations, not to remove them entirely.

What is an AI hallucination?

An AI hallucination is when a model produces information that is fluent and confident but factually wrong or entirely fabricated, such as a fake citation, invented statistic or non-existent product feature. It happens because the model optimises for plausible text, not verified fact.

AI Truth Score — Which AI Hallucinates Most

In one line: hallucination — when an AI states something fluent, confident and wrong — is structural, not a bug. Across 37 models in 2026, measured rates ran from 15% to 52%. The Truth Score is our editorial 0–100 rating of how reliably each model sticks to the truth, and it feeds the Safety & Reliability factor in our scoring.

The Truth Score table

Higher is better — a 96 means very few hallucinations; a 62 means frequent fabrication. Scores are editorial, grounded in published hallucination research and leaderboards.

Model	Truth Score	Hallucination tendency
Claude Sonnet 4.6	96	Lowest — ~3% on factual benchmarks
Claude Fable 5	93	Very low, slight reasoning trade-off
Claude Opus 4.8	92	Very low
Claude Haiku 4.5	90	Low
Perplexity Pro	90	Low — retrieval-grounded with citations
Gemini 3.1 Pro	84	Low-moderate
Microsoft Copilot	83	Low-moderate
GPT-5.4	82	Moderate
GPT-4o	80	Moderate — ~16% on PersonQA
Gemini 3 Flash	80	Moderate
Grok 4.1	79	Moderate — ~12%, highest among top commercial
GPT-5.5	78	Moderate (reasoning trade-off)
Gemini 3.1 Flash-Lite	74	Higher
Llama 4	74	Higher (open-weight)
DeepSeek V3	68	Higher (open-weight)
o3	62	Highest — ~33% on PersonQA

Sources: Vectara hallucination leaderboard, provider research disclosures, and published benchmark studies (PersonQA, TruthfulQA). Editorial synthesis — not a first-person test.

The confidence paradox: smarter isn't more honest

The most counter-intuitive finding of 2026: reasoning models hallucinate more on factual questions. OpenAI's o3 was measured hallucinating around 33% of the time on PersonQA — more than double the ~16% of its predecessor o1. The smaller o4-mini was worse again at 48%. More deliberate reasoning produces more elaborate, more confident, and sometimes more wrong answers. A model that reasons its way to a fabrication is more dangerous than one that simply admits it doesn't know.

Why hallucination is structural

Language models predict the most likely next token, not the true one. When the training data is thin or absent, the model fills the gap with something plausible. A 2025 result showed this is mathematically inevitable for models built this way — hallucination can be reduced and detected, but not eliminated. Any vendor claiming "zero hallucinations" is overselling.

It varies enormously by domain

A single model's hallucination rate is not one number — it depends entirely on the task.

Domain	Typical hallucination rate	Implication
Summarisation	<1.5%	Safe — grounded in supplied text
General factual Q&A	15–33%	Verify anything that matters
Medical case summaries	43–64%	Human clinical review mandatory
Legal queries	58–88%	Never rely on un-checked citations

The lesson for business: a model that is safe for summarising your meeting notes can be dangerous for drafting a contract clause. Match the model — and the review process — to the risk.

The Confident Liar problem

Not all wrong answers are equal. A model that says "I'm not certain, but…" is far safer than one that fabricates with total confidence. We weight confident-wrong answers more harshly in the Truth Score than acknowledged uncertainty. A model that knows the edge of its knowledge is a safer business tool than a more capable model that bluffs.

How to reduce hallucinations

You can't remove them, but you can cut them dramatically. Ranked by measured impact:

Technique	Reduction	When to use
Retrieval-augmented generation (RAG)	~71%	Any factual or document-grounded task
Self-consistency checking	~65%	High-stakes single answers
Ensemble checking (multi-model)	30–50%	Critical decisions worth the extra cost
Prompt mitigation ("say if unsure")	~22pp	Cheap baseline on every prompt
Fine-tuning on domain data	Varies	Narrow, repeated, specialised tasks

The single most effective move is RAG: ground the model in your own verified sources so it retrieves rather than invents. Combined with a human review step on anything that matters, it turns an unreliable generalist into a dependable business tool.

What changed in June 2026

The 37-model benchmark put measured hallucination at 15–52%, confirming the problem is universal, not vendor-specific.
Reasoning models (o3, o4-mini) were shown to hallucinate more on facts — reframing the debate from "can we fix it" to "which model lies least".
Claude Sonnet 4.6 held the lowest measured rate among frontier models at roughly 3%.

Choosing a model where accuracy is critical? Use the match engine and weight safety highly, or compare every model's full scorecard on the comparison table.

Which AI tells the truth?

The Truth Score table

The confidence paradox: smarter isn't more honest

Why hallucination is structural

It varies enormously by domain

The Confident Liar problem

How to reduce hallucinations

What changed in June 2026