New benchmark shows how far we are from AGI

The new ARC-AGI-2 tests for «fluid intelligence» and is bad news for the state of the art models. (Picture: The Arc Prize Foundation)
Artificial General Intelligence denotes when AI becomes more proficient than humans in most tasks, and is the holy grail of current AI research, even though there are different definitions depending on who you ask.

A new cognitive, problem solving benchmark, the ARC-AGI-2, which is easy for humans and really tough for even the best of the reasoning AIs shows just how far away that goal is, TechCrunch reports.

In fact, most state of the art AIs score in the low single digits on this test, whereas humans score an average of 60%.

Easy for humans, tricky for AIs
— ARC-AGI-2 is even harder for AI (in particular, AI reasoning systems), while maintaining the same relative ease for humans. Pure LLMs score 0% on ARC-AGI-2, and public AI reasoning systems achieve only single-digit percentage scores. In contrast, every task in ARC-AGI-2 has been solved by at least 2 humans in under 2 attempts, says The Arc Prize Foundation behind the test in their release post.

The test consists of completing a set of previously unseen puzzles consisting of shapes and colors, and it’s apparently not possible to complete them using «brute force» – by throwing computer power at the problems.

— It’s an AI benchmark designed to measure general fluid intelligence, not memorized skills – a set of never-seen-before tasks that humans find easy, but current AI struggles with,, tweets François Chollet, co-founder of the The Arc Prize Foundation.

No more «Phd-level» tests
The test seems to have more in common with traditional IQ tests than measuring «Pdh-level» reasoning in any particular field — which was the focus of previous tests and approaches to AGI.

Before this, the thinking went that if an AI could complete tasks at Phd level in a variety of different fields, we would have AGI.

This test instead measures intelligence on the fly, not things you can learn by reading training material such as books and research papers.

The best score on the benchmark is currently OpenAIs o3-low-model which got a score of 4 per cent while using $200 of compute cost.

The previous benchmark from The Arc Prize Foundation, the ARC-AGI-1, showed «the exact moment» when reasoning models reached beyond simply repeating learned content, with OpenAIs o3 scoring 75 % and thereby mooting it, making it necessary for a new one.

You can try out the tests here, and see if you can get better scores than the AIs.

Read more: TechCrunch, The Arc Prize Foundation’s blog post, r/singularity discussion.