Quick answer. Highest hallucination: OpenAI's o3 (~33%) and o4-mini (~48%), then open-weight models like DeepSeek V3. Lowest: Claude Sonnet 4.6 (~3%), the Claude family, and retrieval-grounded Perplexity Pro. See the full Truth Score for the scoring method.
Ranked: most to least hallucination
| Rank | Model | Truth Score | Tendency |
|---|---|---|---|
| 1 (worst) | o3 | 62 | ~33% on PersonQA |
| 2 | DeepSeek V3 | 68 | High (open-weight) |
| 3 | Llama 4 | 74 | Higher |
| 3 | Gemini 3.1 Flash-Lite | 74 | Higher |
| 5 | GPT-5.5 | 78 | Moderate (reasoning) |
| 6 | Grok 4.1 | 79 | ~12%, highest of top commercial |
| 7 | GPT-4o / Gemini 3 Flash | 80 | Moderate (~16%) |
| 9 | GPT-5.4 | 82 | Moderate |
| 10 | Microsoft Copilot | 83 | Low-moderate |
| 11 | Gemini 3.1 Pro | 84 | Low-moderate |
| 12 | Claude Haiku 4.5 / Perplexity Pro | 90 | Low |
| 14 | Claude Opus 4.8 | 92 | Very low |
| 15 | Claude Fable 5 | 93 | Very low |
| 16 (best) | Claude Sonnet 4.6 | 96 | Lowest (~3%) |
Sources: Vectara hallucination leaderboard, PersonQA results, provider disclosures. Editorial synthesis.
Why reasoning models are worse
OpenAI's o3 hallucinated more than double its predecessor o1 (16%) on the same benchmark. More elaborate reasoning produces more confident, more detailed — and sometimes more wrong — answers. Read the full explanation of the confidence paradox.
How to pick a reliable model
For accuracy-critical work, favour high Truth Score models and add retrieval. See how to reduce hallucinations and risk by sector in hallucination by industry.
Accuracy matters for your use case? Weight safety in the match engine or compare every model on the table.