New benchmark from ARC-AGI uses puzzles to test for general intelligence

The new benchmark uses block puzzle games to test for AGI, Artificial General Intelligence.
ARC-AGI-3 uses never before seen games to see how good AI is at solving puzzles on the fly. (Picture: Screenshot)
While the other benchmarks from the organization tests for knowledge and competence against humans, the ARC-AGI-3 uses simple games for testing human-like AGI, or Artificial General Intelligence – that some are calling «Superintelligence.»

The test games are easy for humans but exceedingly difficult for machines, and starts out with a 100% score for humans and 0% for AIs tested.

Continue reading “New benchmark from ARC-AGI uses puzzles to test for general intelligence”

Pitting humans against AI at FrontierMath yields mixed results

FrontierMath is notoriously difficult for machines to solve, but they are evolving quickly.
FrontierMath is notoriously difficult for machines to solve, but they are evolving quickly. (Picture: Epoch AI)
Epoch AI, the team behind the ridiculously difficult FrontierMath benchmark, decided to check how well humans do on it — and now predicts superhuman AI performance by a years time.

FrontierMath is a synthetic benchmark that contains 300 questions spanning from upper-graduate level to Field Medalist challenges, and the best machines on it score about 2%.

Continue reading “Pitting humans against AI at FrontierMath yields mixed results”

Meta gamed the LMArena AI benchmarks with new model

People are astounded Meta used a non public, unreleased and optimized model on the industry’s most respected benchmark.
People are astounded Meta used a non public, unreleased and optimized model on the industry’s most respected benchmark. (Picture: Meta)
According to several reports, it seems Meta used an unpublished Llama 4 Maverick model created especially to score well on the LMArena benchmark.

Surprisingly good ranking
The largest selling point for their latest Maverick model was how well it did in precisely this benchmark, scoring just above ChatGPT 4o-latest and and slightly below Gemini 2.5 Pro, considered the current cutting edge of AI engineering.

The fact that Llama 4 Maverick got second place in between these, on the most watched leaderboard in AI, raised a lot of eyebrows over the weekend.

Continue reading “Meta gamed the LMArena AI benchmarks with new model”

New benchmark shows how far we are from AGI

The new ARC-AGI-2 tests for «fluid intelligence» and is bad news for the state of the art models. (Picture: The Arc Prize Foundation)
Artificial General Intelligence denotes when AI becomes more proficient than humans in most tasks, and is the holy grail of current AI research, even though there are different definitions depending on who you ask.

A new cognitive, problem solving benchmark, the ARC-AGI-2, which is easy for humans and really tough for even the best of the reasoning AIs shows just how far away that goal is, TechCrunch reports.

In fact, most state of the art AIs score in the low single digits on this test, whereas humans score an average of 60%.

Continue reading “New benchmark shows how far we are from AGI”