In one line: ARC-AGI measures fluid, novel reasoning — the ability to figure out something you've never seen. Models that ace bar exams and write production code score near zero, because the test can't be solved by recombining training data. It's the clearest evidence of what today's AI genuinely cannot do.
What it tests
ARC-AGI-3 drops an agent into interactive environments — little games — with no instructions, no stated goals and no explicit rules. The agent has to discover everything through trial and observation, exactly the way a person does when handed a game they've never played. It rewards genuine learning-from-scratch, not stored knowledge.
The March 2026 result
| Participant | ARC-AGI-3 score |
|---|---|
| Untrained humans | 100% |
| GPT-5.4 | 0.26% |
| Gemini 3.1 Pro | <1% |
| Claude Opus 4.6 | <1% |
| Grok-4.20 | <1% |
A 10-year-old masters a new mobile game in minutes. GPT-5.4 scored 0.26% — and burned thousands of dollars of compute doing it.
Why this happens
Everything an LLM does well is grounded in patterns it absorbed during training (see what an LLM is). ARC-AGI deliberately removes that crutch: there's nothing to recall, only something to work out. Novel reasoning and learning from interaction are exactly what next-token prediction doesn't provide — the heart of the argument in can AI become intelligent.
Why the contrast matters
The same models score 70%+ on SWE-bench and pass professional exams. That isn't a contradiction — those tasks are in the training distribution, ARC-AGI isn't. The lesson for business: AI is superb on work that resembles what it has seen, and unreliable the moment a task is genuinely new. That line is your trust boundary.
What it means for your business
- Trust AI inside its distribution — drafting, summarising, coding common patterns.
- Keep humans on the novel and high-stakes — anything genuinely new, ambiguous or consequential.
- Discount AGI hype in procurement — capability on novel reasoning is, for now, near zero.
See the bigger debate: AGI explained honestly puts this result in context, and if it quacks, it's a duck asks whether it even matters for business.