Understanding AI · Updated June 2026

The ARC-AGI benchmark, explained

One test exposes the gap between looking smart and being smart. On ARC-AGI-3, GPT-5.4 scored 0.26% — at $5,000–$9,000 per task. Untrained humans scored 100%. That single line is the most important sentence in AI right now.

In one line: ARC-AGI measures fluid, novel reasoning — the ability to figure out something you've never seen. Models that ace bar exams and write production code score near zero, because the test can't be solved by recombining training data. It's the clearest evidence of what today's AI genuinely cannot do.

What it tests

ARC-AGI-3 drops an agent into interactive environments — little games — with no instructions, no stated goals and no explicit rules. The agent has to discover everything through trial and observation, exactly the way a person does when handed a game they've never played. It rewards genuine learning-from-scratch, not stored knowledge.

The March 2026 result

ParticipantARC-AGI-3 score
Untrained humans100%
GPT-5.40.26%
Gemini 3.1 Pro<1%
Claude Opus 4.6<1%
Grok-4.20<1%

A 10-year-old masters a new mobile game in minutes. GPT-5.4 scored 0.26% — and burned thousands of dollars of compute doing it.

Why this happens

Everything an LLM does well is grounded in patterns it absorbed during training (see what an LLM is). ARC-AGI deliberately removes that crutch: there's nothing to recall, only something to work out. Novel reasoning and learning from interaction are exactly what next-token prediction doesn't provide — the heart of the argument in can AI become intelligent.

Why the contrast matters

The same models score 70%+ on SWE-bench and pass professional exams. That isn't a contradiction — those tasks are in the training distribution, ARC-AGI isn't. The lesson for business: AI is superb on work that resembles what it has seen, and unreliable the moment a task is genuinely new. That line is your trust boundary.

What it means for your business

See the bigger debate: AGI explained honestly puts this result in context, and if it quacks, it's a duck asks whether it even matters for business.