
The test games are easy for humans but exceedingly difficult for machines, and starts out with a 100% score for humans and 0% for AIs tested.
Continue reading “New benchmark from ARC-AGI uses puzzles to test for general intelligence”

The test games are easy for humans but exceedingly difficult for machines, and starts out with a 100% score for humans and 0% for AIs tested.
Continue reading “New benchmark from ARC-AGI uses puzzles to test for general intelligence”

FrontierMath is a synthetic benchmark that contains 300 questions spanning from upper-graduate level to Field Medalist challenges, and the best machines on it score about 2%.
Continue reading “Pitting humans against AI at FrontierMath yields mixed results”

Surprisingly good ranking
The largest selling point for their latest Maverick model was how well it did in precisely this benchmark, scoring just above ChatGPT 4o-latest and and slightly below Gemini 2.5 Pro, considered the current cutting edge of AI engineering.
The fact that Llama 4 Maverick got second place in between these, on the most watched leaderboard in AI, raised a lot of eyebrows over the weekend.
Continue reading “Meta gamed the LMArena AI benchmarks with new model”

A new cognitive, problem solving benchmark, the ARC-AGI-2, which is easy for humans and really tough for even the best of the reasoning AIs shows just how far away that goal is, TechCrunch reports.
In fact, most state of the art AIs score in the low single digits on this test, whereas humans score an average of 60%.
Continue reading “New benchmark shows how far we are from AGI”