New benchmark from ARC-AGI uses puzzles to test for general intelligence

The new benchmark uses block puzzle games to test for AGI, Artificial General Intelligence. — ARC-AGI-3 uses never before seen games to see how good AI is at solving puzzles on the fly. (Picture: Screenshot)

While the other benchmarks from the organization tests for knowledge and competence against humans, the ARC-AGI-3 uses simple games for testing human-like AGI, or Artificial General Intelligence – that some are calling «Superintelligence.»

The test games are easy for humans but exceedingly difficult for machines, and starts out with a 100% score for humans and 0% for AIs tested.

The games themselves are just simple blocks on the screen, and give no description of what the goals are, how to beat it or even instruction on how it works, both humans and AIs have to figure this out themselves.

Humans can solve it intuitively and congitively, but machines get stuck in a loop – so far.

New and random games to solve
The benchmark is described as a «set of hand-crafted novel environments that are designed to test the skill-acquisition efficiency of artificial systems.»

That basically means that the games are new each time and cannot be, well, gamed, or learned about beforehand. And that the AIs will need to learn the games on the fly, acquiring new skills in the process.

To win in the games, AI models will have to demonstrate intelligence over time, think through extended trajectories, use memory functions, self-reflection, and will have to execute plans within the game.

A break from tradition
— Traditionally, to measure intelligence, static benchmarks have been the yardstick, but they do not have the bandwidth required to measure the full spectrum of intelligence, they say on the website.

The benchmark provides a «rich medium to test experience-driven competence,» they say, and that until it is solved — «we do not have AGI.»

First game already solved by ChatGPT Agent?
X.com user Zhiqing Sun claims to have used ChatGPT Agent to solve the first puzzle already, whereas ChatGPT o3 and Grok 4 struggle greatly and cannot complete a single level:

just tried and the agent solved level 1 in its own browser lol. thanks for creating the benchmark! https://t.co/4DnRc9l3T7 pic.twitter.com/n4jGhsrE2n

— Zhiqing Sun (@EdwardSun0909) July 18, 2025

Read more: Launch page for ARC-AGI-3, where you can also play the games yourself. Discussion on r/singularity.

New benchmark from ARC-AGI uses puzzles to test for general intelligence

AI use to become mandatory at Microsoft division

Google rolls out Veo 3 for Gemini Pro users globally

With help from top AI labs, American teachers to get better, free training

Grok’s new «companions:» sex crazed lovebot and a profane firestarter

OpenAI launches ChatGPT Agent mode, for tasks both easy and tough

In a first, judge rules training AI on copyrighted works is fair use