In one line: a model's overall score is a weighted average of task performance, cost efficiency, context window, speed, safety, privacy, integration and adoption ease. Weights are locked at v1.0 and version-controlled. Every individual score links to its source.
The 8 scoring factors
| Factor | Weight | What it measures |
|---|---|---|
| Task Performance | 25% | Benchmark scores by task: coding (SWE-bench), reasoning (ARC-AGI-2), writing, analysis, multimodal |
| Cost Efficiency | 20% | Input/output token price, context caching discounts, agentic multiplier (5–20x), volume economics |
| Context Window | 15% | Token context length — critical for document processing, long workflows and large codebases |
| Speed / Latency | 10% | Tokens per second, time-to-first-token — critical for live interactions and real-time agents |
| Safety & Reliability | 10% | Hallucination rates by task, refusal consistency, uptime, context faithfulness |
| Data Privacy | 10% | Training opt-out, enterprise DPA, GDPR/HIPAA/SOC2 compliance, data residency |
| Integration Quality | 5% | API quality, SDK availability, MCP support, rate limits, enterprise SLA |
| Adoption Ease | 5% | No-code access, fine-tuning availability, documentation quality, community size |
The overall weighted score reflects all-round business value. It is not a verdict on which model is "best" — the best model for a specific task is often different, which is exactly why the match engine and per-task pages exist.
What our scores are — and are not
They are an editorial synthesis of published evidence: provider pricing pages, official model documentation, public benchmark leaderboards (SWE-bench, ARC-AGI-2, Scale SEAL, Artificial Analysis) and independent test reports.
They are not first-person lab tests. We never claim to have personally benchmarked a model's speed or run our own evaluations. Where a vendor-reported number differs materially from an independent one, we show both and flag the gap.
Data sources
- Provider pricing and documentation pages (Anthropic, OpenAI, Google, DeepSeek, Meta, xAI, Microsoft, Perplexity)
- SWE-bench Verified — swebench.com
- Scale SEAL leaderboard — scale.com/leaderboard
- ARC Prize / ARC-AGI — arcprize.org
- Artificial Analysis — artificialanalysis.ai
Update cadence
AI pricing has dropped roughly 80% in the past year and benchmark leaderboards change monthly. This is the product: a stale comparison site loses authority fast in this market.
- Monthly: token pricing, data-verified date, new flagship model scoring within two weeks of general availability.
- Quarterly: full re-score of all models, schema validation, cross-link audit.
Scoring weight version log
| Version | Date | Change |
|---|---|---|
| v1.0 | June 2026 | Initial weighting across 8 factors. Locked. |
Weights will not change without incrementing the version here and recording the rationale on the ethics page.