What is the ARC-AGI benchmark?

ARC-AGI is a test of fluid, novel reasoning. The latest version, ARC-AGI-3, drops an agent into interactive environments with no instructions, goals or rules, so it must work everything out through trial and observation — the way a person would with an unfamiliar game.

Why do AI models fail ARC-AGI?

Because it cannot be solved by recombining training patterns. It requires genuinely novel reasoning and learning from interaction, which current LLMs lack. In March 2026 every frontier model scored below 1% while untrained humans scored 100%.

What does ARC-AGI tell us about AI limits?

It shows that models which ace bar exams and write production code still cannot handle simple novel problems outside their training distribution. It is the clearest single measure of the gap between pattern completion and general reasoning.

ARC-AGI Benchmark Explained

In one line: ARC-AGI measures fluid, novel reasoning — the ability to figure out something you've never seen. Models that ace bar exams and write production code score near zero, because the test can't be solved by recombining training data. It's the clearest evidence of what today's AI genuinely cannot do.

What it tests

ARC-AGI-3 drops an agent into interactive environments — little games — with no instructions, no stated goals and no explicit rules. The agent has to discover everything through trial and observation, exactly the way a person does when handed a game they've never played. It rewards genuine learning-from-scratch, not stored knowledge.

The March 2026 result

Participant	ARC-AGI-3 score
Untrained humans	100%
GPT-5.4	0.26%
Gemini 3.1 Pro	<1%
Claude Opus 4.6	<1%
Grok-4.20	<1%

A 10-year-old masters a new mobile game in minutes. GPT-5.4 scored 0.26% — and burned thousands of dollars of compute doing it.

Why this happens

Everything an LLM does well is grounded in patterns it absorbed during training (see what an LLM is). ARC-AGI deliberately removes that crutch: there's nothing to recall, only something to work out. Novel reasoning and learning from interaction are exactly what next-token prediction doesn't provide — the heart of the argument in can AI become intelligent.

Why the contrast matters

The same models score 70%+ on SWE-bench and pass professional exams. That isn't a contradiction — those tasks are in the training distribution, ARC-AGI isn't. The lesson for business: AI is superb on work that resembles what it has seen, and unreliable the moment a task is genuinely new. That line is your trust boundary.

What it means for your business

Trust AI inside its distribution — drafting, summarising, coding common patterns.
Keep humans on the novel and high-stakes — anything genuinely new, ambiguous or consequential.
Discount AGI hype in procurement — capability on novel reasoning is, for now, near zero.

See the bigger debate: AGI explained honestly puts this result in context, and if it quacks, it's a duck asks whether it even matters for business.

The ARC-AGI benchmark, explained

What it tests

The March 2026 result

Why this happens

Why the contrast matters

What it means for your business