Are AI companies honest about their models?

Partially. Vendor-reported benchmark numbers frequently differ from independent measurement, sometimes substantially. The AI Transparency Audit grades each provider on how closely its claims about hallucination, neutrality and privacy match independent evidence.

What is the AI Transparency Audit?

It is a public, versioned, quarterly report card from Best AI Match comparing each model's claimed characteristics against independently measured ones across accuracy, bias and privacy. It is sourced and updated, and exists to hold providers accountable to their own marketing.

How often is the AI Transparency Audit updated?

Quarterly. Benchmark leaderboards and provider claims change constantly, so the audit is re-run each quarter and versioned, with the date shown on the page.

AI Transparency Audit 2026

Why this exists. Vendor-reported numbers routinely diverge from independent measurement — sometimes by double digits. No one place tracks claimed-versus-actual across accuracy, bias and privacy together. This audit does, with sources, so a buyer making a six-figure decision can see who is straight with them.

The grades (Q2 2026)

Each provider graded A–D on how well its public claims line up with independent evidence. This grades transparency, not model quality — a model can be excellent and still be over-marketed.

Provider	Accuracy claims	Neutrality claims	Privacy claims	Overall
Anthropic	A	B	A	A
Google	B	A	C	B
Microsoft	B	B	A	B
OpenAI	C	B	B	B
Perplexity	B	B	B	B
Meta	B	B	A	B
xAI	C	C	C	C
DeepSeek	C	C	D	D

Accuracy: claimed vs measured

The clearest transparency gap is in coding benchmarks, where vendor SWE-bench figures often differ from independent Scale SEAL numbers.

Model	Vendor SWE-bench	Independent	Gap
Claude Fable 5	80.3%	95%	+14.7 (under-claimed)
Claude Opus 4.8	88.6%	86%	−2.6
GPT-5.5	87%	84%	−3.0
DeepSeek V3	79%	74%	−5.0

Notably, the largest gap runs the honest direction: Claude Fable 5 measures higher independently than its vendor claimed. Over-claiming (where independent is lower) is the transparency concern.

Neutrality: claimed vs measured

Most providers claim balance; independent benchmarks find a consistent left-of-centre lean (Gemini closest to neutral). The grade reflects how candidly each acknowledges its measured position rather than asserting neutrality it doesn't have. Full detail on the Perspective Score.

Privacy: claimed vs documented

Grades reflect clarity of data-handling commitments and documented practice. Western enterprise providers (Anthropic, Microsoft, Meta self-host) score highest; providers with opaque or non-compliant data handling score lowest. See the privacy checklist for what to verify yourself.

Methodology

Accuracy: vendor benchmark claims vs independent leaderboards (Scale SEAL, Vectara, PersonQA). Larger over-claim = lower grade.
Neutrality: stated balance vs measured lean (Promptfoo, IEEE/TechRxiv, Stanford). Candour about measured position raises the grade.
Privacy: documented compliance (GDPR/HIPAA/SOC2), residency clarity and training-opt-out transparency.

Grades are editorial judgments from public evidence, not first-person testing. This is v1.0, Q2 2026. The next revision is due Q3 2026.

This is the report nobody else publishes. If you represent a provider and believe a grade misreads the evidence, contact us via the about page with sources and we will review it openly in the next version. Related: the Truth Score and Perspective Score.