Methodology

How we score AI models

Every model is scored across eight weighted factors. Scores are editorial — based on published benchmarks, provider documentation and independent test reports. We do not run first-person lab tests, and we take no payment for placement.

In one line: a model's overall score is a weighted average of task performance, cost efficiency, context window, speed, safety, privacy, integration and adoption ease. Weights are locked at v1.0 and version-controlled. Every individual score links to its source.

The 8 scoring factors

FactorWeightWhat it measures
Task Performance25%Benchmark scores by task: coding (SWE-bench), reasoning (ARC-AGI-2), writing, analysis, multimodal
Cost Efficiency20%Input/output token price, context caching discounts, agentic multiplier (5–20x), volume economics
Context Window15%Token context length — critical for document processing, long workflows and large codebases
Speed / Latency10%Tokens per second, time-to-first-token — critical for live interactions and real-time agents
Safety & Reliability10%Hallucination rates by task, refusal consistency, uptime, context faithfulness
Data Privacy10%Training opt-out, enterprise DPA, GDPR/HIPAA/SOC2 compliance, data residency
Integration Quality5%API quality, SDK availability, MCP support, rate limits, enterprise SLA
Adoption Ease5%No-code access, fine-tuning availability, documentation quality, community size

The overall weighted score reflects all-round business value. It is not a verdict on which model is "best" — the best model for a specific task is often different, which is exactly why the match engine and per-task pages exist.

What our scores are — and are not

They are an editorial synthesis of published evidence: provider pricing pages, official model documentation, public benchmark leaderboards (SWE-bench, ARC-AGI-2, Scale SEAL, Artificial Analysis) and independent test reports.

They are not first-person lab tests. We never claim to have personally benchmarked a model's speed or run our own evaluations. Where a vendor-reported number differs materially from an independent one, we show both and flag the gap.

Data sources

Update cadence

AI pricing has dropped roughly 80% in the past year and benchmark leaderboards change monthly. This is the product: a stale comparison site loses authority fast in this market.

Scoring weight version log

VersionDateChange
v1.0June 2026Initial weighting across 8 factors. Locked.

Weights will not change without incrementing the version here and recording the rationale on the ethics page.