Why this exists. Vendor-reported numbers routinely diverge from independent measurement — sometimes by double digits. No one place tracks claimed-versus-actual across accuracy, bias and privacy together. This audit does, with sources, so a buyer making a six-figure decision can see who is straight with them.
The grades (Q2 2026)
Each provider graded A–D on how well its public claims line up with independent evidence. This grades transparency, not model quality — a model can be excellent and still be over-marketed.
| Provider | Accuracy claims | Neutrality claims | Privacy claims | Overall |
|---|---|---|---|---|
| Anthropic | A | B | A | A |
| B | A | C | B | |
| Microsoft | B | B | A | B |
| OpenAI | C | B | B | B |
| Perplexity | B | B | B | B |
| Meta | B | B | A | B |
| xAI | C | C | C | C |
| DeepSeek | C | C | D | D |
Accuracy: claimed vs measured
The clearest transparency gap is in coding benchmarks, where vendor SWE-bench figures often differ from independent Scale SEAL numbers.
| Model | Vendor SWE-bench | Independent | Gap |
|---|---|---|---|
| Claude Fable 5 | 80.3% | 95% | +14.7 (under-claimed) |
| Claude Opus 4.8 | 88.6% | 86% | −2.6 |
| GPT-5.5 | 87% | 84% | −3.0 |
| DeepSeek V3 | 79% | 74% | −5.0 |
Notably, the largest gap runs the honest direction: Claude Fable 5 measures higher independently than its vendor claimed. Over-claiming (where independent is lower) is the transparency concern.
Neutrality: claimed vs measured
Most providers claim balance; independent benchmarks find a consistent left-of-centre lean (Gemini closest to neutral). The grade reflects how candidly each acknowledges its measured position rather than asserting neutrality it doesn't have. Full detail on the Perspective Score.
Privacy: claimed vs documented
Grades reflect clarity of data-handling commitments and documented practice. Western enterprise providers (Anthropic, Microsoft, Meta self-host) score highest; providers with opaque or non-compliant data handling score lowest. See the privacy checklist for what to verify yourself.
Methodology
- Accuracy: vendor benchmark claims vs independent leaderboards (Scale SEAL, Vectara, PersonQA). Larger over-claim = lower grade.
- Neutrality: stated balance vs measured lean (Promptfoo, IEEE/TechRxiv, Stanford). Candour about measured position raises the grade.
- Privacy: documented compliance (GDPR/HIPAA/SOC2), residency clarity and training-opt-out transparency.
Grades are editorial judgments from public evidence, not first-person testing. This is v1.0, Q2 2026. The next revision is due Q3 2026.
This is the report nobody else publishes. If you represent a provider and believe a grade misreads the evidence, contact us via the about page with sources and we will review it openly in the next version. Related: the Truth Score and Perspective Score.