Quarterly report card ยท Q2 2026 (v1.0)

AI Transparency Audit 2026

A public report card on a simple question: are AI providers honest about their own models? We grade each on how closely its claims about accuracy, neutrality and privacy match independent evidence. Versioned and updated quarterly.

Why this exists. Vendor-reported numbers routinely diverge from independent measurement — sometimes by double digits. No one place tracks claimed-versus-actual across accuracy, bias and privacy together. This audit does, with sources, so a buyer making a six-figure decision can see who is straight with them.

The grades (Q2 2026)

Each provider graded A–D on how well its public claims line up with independent evidence. This grades transparency, not model quality — a model can be excellent and still be over-marketed.

ProviderAccuracy claimsNeutrality claimsPrivacy claimsOverall
AnthropicABAA
GoogleBACB
MicrosoftBBAB
OpenAICBBB
PerplexityBBBB
MetaBBAB
xAICCCC
DeepSeekCCDD

Accuracy: claimed vs measured

The clearest transparency gap is in coding benchmarks, where vendor SWE-bench figures often differ from independent Scale SEAL numbers.

ModelVendor SWE-benchIndependentGap
Claude Fable 580.3%95%+14.7 (under-claimed)
Claude Opus 4.888.6%86%−2.6
GPT-5.587%84%−3.0
DeepSeek V379%74%−5.0

Notably, the largest gap runs the honest direction: Claude Fable 5 measures higher independently than its vendor claimed. Over-claiming (where independent is lower) is the transparency concern.

Neutrality: claimed vs measured

Most providers claim balance; independent benchmarks find a consistent left-of-centre lean (Gemini closest to neutral). The grade reflects how candidly each acknowledges its measured position rather than asserting neutrality it doesn't have. Full detail on the Perspective Score.

Privacy: claimed vs documented

Grades reflect clarity of data-handling commitments and documented practice. Western enterprise providers (Anthropic, Microsoft, Meta self-host) score highest; providers with opaque or non-compliant data handling score lowest. See the privacy checklist for what to verify yourself.

Methodology

Grades are editorial judgments from public evidence, not first-person testing. This is v1.0, Q2 2026. The next revision is due Q3 2026.

This is the report nobody else publishes. If you represent a provider and believe a grade misreads the evidence, contact us via the about page with sources and we will review it openly in the next version. Related: the Truth Score and Perspective Score.