
The test games are easy for humans but exceedingly difficult for machines, and starts out with a 100% score for humans and 0% for AIs tested.
The games themselves are just simple blocks on the screen, and give no description of what the goals are, how to beat it or even instruction on how it works, both humans and AIs have to figure this out themselves.
Humans can solve it intuitively and congitively, but machines get stuck in a loop – so far.
New and random games to solve
The benchmark is described as a «set of hand-crafted novel environments that are designed to test the skill-acquisition efficiency of artificial systems.»
That basically means that the games are new each time and cannot be, well, gamed, or learned about beforehand. And that the AIs will need to learn the games on the fly, acquiring new skills in the process.
To win in the games, AI models will have to demonstrate intelligence over time, think through extended trajectories, use memory functions, self-reflection, and will have to execute plans within the game.
A break from tradition
— Traditionally, to measure intelligence, static benchmarks have been the yardstick, but they do not have the bandwidth required to measure the full spectrum of intelligence, they say on the website.
The benchmark provides a «rich medium to test experience-driven competence,» they say, and that until it is solved — «we do not have AGI.»
First game already solved by ChatGPT Agent?
X.com user Zhiqing Sun claims to have used ChatGPT Agent to solve the first puzzle already, whereas ChatGPT o3 and Grok 4 struggle greatly and cannot complete a single level:
just tried and the agent solved level 1 in its own browser lol. thanks for creating the benchmark! https://t.co/4DnRc9l3T7 pic.twitter.com/n4jGhsrE2n
— Zhiqing Sun (@EdwardSun0909) July 18, 2025
Read more: Launch page for ARC-AGI-3, where you can also play the games yourself. Discussion on r/singularity.