Truth Score series · Updated June 2026

Which AI hallucinates most?

o3 hallucinates most among leading models — around 33% on PersonQA. Claude Sonnet 4.6 hallucinates least at roughly 3%. The surprise: the worst offenders are reasoning models, not the cheapest ones.

Quick answer. Highest hallucination: OpenAI's o3 (~33%) and o4-mini (~48%), then open-weight models like DeepSeek V3. Lowest: Claude Sonnet 4.6 (~3%), the Claude family, and retrieval-grounded Perplexity Pro. See the full Truth Score for the scoring method.

Ranked: most to least hallucination

RankModelTruth ScoreTendency
1 (worst)o362~33% on PersonQA
2DeepSeek V368High (open-weight)
3Llama 474Higher
3Gemini 3.1 Flash-Lite74Higher
5GPT-5.578Moderate (reasoning)
6Grok 4.179~12%, highest of top commercial
7GPT-4o / Gemini 3 Flash80Moderate (~16%)
9GPT-5.482Moderate
10Microsoft Copilot83Low-moderate
11Gemini 3.1 Pro84Low-moderate
12Claude Haiku 4.5 / Perplexity Pro90Low
14Claude Opus 4.892Very low
15Claude Fable 593Very low
16 (best)Claude Sonnet 4.696Lowest (~3%)

Sources: Vectara hallucination leaderboard, PersonQA results, provider disclosures. Editorial synthesis.

Why reasoning models are worse

OpenAI's o3 hallucinated more than double its predecessor o1 (16%) on the same benchmark. More elaborate reasoning produces more confident, more detailed — and sometimes more wrong — answers. Read the full explanation of the confidence paradox.

How to pick a reliable model

For accuracy-critical work, favour high Truth Score models and add retrieval. See how to reduce hallucinations and risk by sector in hallucination by industry.

Accuracy matters for your use case? Weight safety in the match engine or compare every model on the table.