teknotum
Skip to content

Teknotum

New benchmark shows how far we are from AGI

The new ARC-AGI-2 tests for «fluid intelligence» and is bad news for the state of the art models. (Picture: The Arc Prize Foundation)
Artificial General Intelligence denotes when AI becomes more proficient than humans in most tasks, and is the holy grail of current AI research, even though there are different definitions depending on who you ask.

A new cognitive, problem solving benchmark, the ARC-AGI-2, which is easy for humans and really tough for even the best of the reasoning AIs shows just how far away that goal is, TechCrunch reports.

In fact, most state of the art AIs score in the low single digits on this test, whereas humans score an average of 60%.

Easy for humans, tricky for AIs
— ARC-AGI-2 is even harder for AI (in particular, AI reasoning systems), while maintaining the same relative ease for humans. Pure LLMs score 0% on ARC-AGI-2, and public AI reasoning systems achieve only single-digit percentage scores. In contrast, every task in ARC-AGI-2 has been solved by at least 2 humans in under 2 attempts, says The Arc Prize Foundation behind the test in their release post.

The test consists of completing a set of previously unseen puzzles consisting of shapes and colors, and it’s apparently not possible to complete them using «brute force» – by throwing computer power at the problems.

— It’s an AI benchmark designed to measure general fluid intelligence, not memorized skills – a set of never-seen-before tasks that humans find easy, but current AI struggles with,, tweets François Chollet, co-founder of the The Arc Prize Foundation.

No more «Phd-level» tests
The test seems to have more in common with traditional IQ tests than measuring «Pdh-level» reasoning in any particular field — which was the focus of previous tests and approaches to AGI.

Before this, the thinking went that if an AI could complete tasks at Phd level in a variety of different fields, we would have AGI.

This test instead measures intelligence on the fly, not things you can learn by reading training material such as books and research papers.

The best score on the benchmark is currently OpenAIs o3-low-model which got a score of 4 per cent while using $200 of compute cost.

The previous benchmark from The Arc Prize Foundation, the ARC-AGI-1, showed «the exact moment» when reasoning models reached beyond simply repeating learned content, with OpenAIs o3 scoring 75 % and thereby mooting it, making it necessary for a new one.

You can try out the tests here, and see if you can get better scores than the AIs.

Read more: TechCrunch, The Arc Prize Foundation’s blog post, r/singularity discussion.

Author Tor FosheimPosted on 25. March 202526. March 2025Tags AI, benchmarks, openai

Post navigation

Previous Previous post: 99.26 % accurate: Researchers reveal new AI for cancer detection
Next Next post: OpenAI drastically improves ChatGPT image generation

You might also like

Timbaland’s next pop starlet is an AI avatar

AIFF, the AI Film Festival, showcases innovative video, as industry set to pounce

OpenAI’s Codex now available to ChatGPT Plus users

Anthropic CEO says it’s time to wake up on AI job losses

Big Tech stops hiring new graduates, as entry level jobs dry up

Court rejects free speech rights for AI chatbots — for now

From the front page

Meta hires Ruoming Pang, Apple’s lead on foundational models

06:55 08 Jul 2025

Grok takes a hard rightward turn with «significant» new update

06:50 07 Jul 2025

Short Friday news roundup

07:00 04 Jul 2025

Google rolls out Veo 3 for Gemini Pro users globally

07:33 03 Jul 2025

Cloudflare to block AI crawlers by default, charge micropayments

06:58 02 Jul 2025

Adobe AI airplanes anthropic apple bard cancer chatgpt climate coding copilot copyright defense drones energy facebook film game gemini google images instagram internet iphone law llama media meta Microsoft military netflix openai playstation research romfart science search sony sosiale medier streaming tiktok TV twitter work zuckerberg

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
  • About teknotum
  • Newsletter
Teknotum Manually edited with WordPress