In one line: a model's overall score is a weighted average of task performance, cost efficiency, context window, speed, safety, privacy, integration and adoption ease. Weights are locked at v1.0 and version-controlled. Every individual score links to its source.

The 8 scoring factors

Factor	Weight	What it measures
Task Performance	25%	Benchmark scores by task: coding (SWE-bench), reasoning (ARC-AGI-2), writing, analysis, multimodal
Cost Efficiency	20%	Input/output token price, context caching discounts, agentic multiplier (5–20x), volume economics
Context Window	15%	Token context length — critical for document processing, long workflows and large codebases
Speed / Latency	10%	Tokens per second, time-to-first-token — critical for live interactions and real-time agents
Safety & Reliability	10%	Hallucination rates by task, refusal consistency, uptime, context faithfulness
Data Privacy	10%	Training opt-out, enterprise DPA, GDPR/HIPAA/SOC2 compliance, data residency
Integration Quality	5%	API quality, SDK availability, MCP support, rate limits, enterprise SLA
Adoption Ease	5%	No-code access, fine-tuning availability, documentation quality, community size

The overall weighted score reflects all-round business value. It is not a verdict on which model is "best" — the best model for a specific task is often different, which is exactly why the match engine and per-task pages exist.

What our scores are — and are not

They are an editorial synthesis of published evidence: provider pricing pages, official model documentation, public benchmark leaderboards (SWE-bench, ARC-AGI-2, Scale SEAL, Artificial Analysis) and independent test reports.

They are not first-person lab tests. We never claim to have personally benchmarked a model's speed or run our own evaluations. Where a vendor-reported number differs materially from an independent one, we show both and flag the gap.

Data sources

Provider pricing and documentation pages (Anthropic, OpenAI, Google, DeepSeek, Meta, xAI, Microsoft, Perplexity)
SWE-bench Verified — swebench.com
Scale SEAL leaderboard — scale.com/leaderboard
ARC Prize / ARC-AGI — arcprize.org
Artificial Analysis — artificialanalysis.ai

Update cadence

AI pricing has dropped roughly 80% in the past year and benchmark leaderboards change monthly. This is the product: a stale comparison site loses authority fast in this market.

Monthly: token pricing, data-verified date, new flagship model scoring within two weeks of general availability.
Quarterly: full re-score of all models, schema validation, cross-link audit.

Scoring weight version log

Version	Date	Change
v1.0	June 2026	Initial weighting across 8 factors. Locked.

Weights will not change without incrementing the version here and recording the rationale on the ethics page.