Which AI hallucinates the most?

Among leading models, OpenAI's o3 hallucinates most, measured around 33% on the PersonQA benchmark, with o4-mini higher at 48%. Open-weight models such as DeepSeek V3 also rank high. Claude Sonnet 4.6 hallucinates least at roughly 3%.

Which AI is most accurate?

For factual reliability, Claude Sonnet 4.6 leads with the lowest measured hallucination rate, followed by other Claude models and retrieval-grounded Perplexity Pro. Accuracy also depends heavily on the task and on whether retrieval (RAG) is used.

Which AI Hallucinates Most?

Quick answer. Highest hallucination: OpenAI's o3 (~33%) and o4-mini (~48%), then open-weight models like DeepSeek V3. Lowest: Claude Sonnet 4.6 (~3%), the Claude family, and retrieval-grounded Perplexity Pro. See the full Truth Score for the scoring method.

Ranked: most to least hallucination

Rank	Model	Truth Score	Tendency
1 (worst)	o3	62	~33% on PersonQA
2	DeepSeek V3	68	High (open-weight)
3	Llama 4	74	Higher
3	Gemini 3.1 Flash-Lite	74	Higher
5	GPT-5.5	78	Moderate (reasoning)
6	Grok 4.1	79	~12%, highest of top commercial
7	GPT-4o / Gemini 3 Flash	80	Moderate (~16%)
9	GPT-5.4	82	Moderate
10	Microsoft Copilot	83	Low-moderate
11	Gemini 3.1 Pro	84	Low-moderate
12	Claude Haiku 4.5 / Perplexity Pro	90	Low
14	Claude Opus 4.8	92	Very low
15	Claude Fable 5	93	Very low
16 (best)	Claude Sonnet 4.6	96	Lowest (~3%)

Sources: Vectara hallucination leaderboard, PersonQA results, provider disclosures. Editorial synthesis.

Why reasoning models are worse

OpenAI's o3 hallucinated more than double its predecessor o1 (16%) on the same benchmark. More elaborate reasoning produces more confident, more detailed — and sometimes more wrong — answers. Read the full explanation of the confidence paradox.

How to pick a reliable model

For accuracy-critical work, favour high Truth Score models and add retrieval. See how to reduce hallucinations and risk by sector in hallucination by industry.

Accuracy matters for your use case? Weight safety in the match engine or compare every model on the table.

Which AI hallucinates most?

Ranked: most to least hallucination

Why reasoning models are worse

How to pick a reliable model