Pitting humans against AI at FrontierMath yields mixed results

FrontierMath is notoriously difficult for machines to solve, but they are evolving quickly. (Picture: Epoch AI)

Epoch AI, the team behind the ridiculously difficult FrontierMath benchmark, decided to check how well humans do on it — and now predicts superhuman AI performance by a years time.

FrontierMath is a synthetic benchmark that contains 300 questions spanning from upper-graduate level to Field Medalist challenges, and the best machines on it score about 2%.

Undergrads and PhDs from MIT
For the human versus AI test, they went to MIT and chose 40 academics from exceptional undergrads to PhDs and split them into teams of four to five.

The eight teams were then given internet access and were told to solve 23 of the «easier» questions from the benchmark, while the same was given to the current top performing AI on the benchmark, which is OpenAIs o4-mini-medium.

FrontierMath versus humans results are mixed. — Humans do better on aggregate, but individual teams lost out to the OpenAI model

Better than teams, worse on aggregate
Once added up, the AI beat the average human team by small margins, but lost to the accumulated score of all the human teams. By a large margin.

The sum of the human teams won with 30-40 %, while the AI got a little over 20 %.

So, does this make humans better at solving math than AI, the researchers at Epoch AI ask?

The answer seems to be yes, but the questions in this test revolves more around reasoning than knowledge, to round out AIs already huge advantage of having more knowledge than «even the most erudite human mathematicians.»

o4 finished substantially faster
It also took o4-mini-medium a mere 5-20 minutes to complete each problem, finishing a lot faster than the humans did with around 40 minutes per task.

The report concludes that while humans teams and the current state of the art AI score in about the same ballpark, looking at the benchmark and AI evolution, they expect that to change.

Indeed, the report says they «think it’s likely that AIs will unambiguously exceed this threshold by the end of the year.»

— I think this is a useful human baseline that helps put FrontierMath evaluations into context, and I’m interested to see when AIs cross this threshold, the report concludes.

Read more: The report from Epoch AI and a handy Twitter thread. More on FrontierMath, discussion on r/singularity, and OpenAI’s o3, o4 release from teknotum.

Pitting humans against AI at FrontierMath yields mixed results

GPT-5.5 found on par with Mythos, with OpenAI to limit access to Cyber version

The Oscars move to protect human authorship in new rules targeting AI use

OpenAI confirms and explains GPT’s affinity for mentioning goblins

OpenAI debuts «pets» for Codex — and they are actually useful

The White House reportedly discussing vetting AI models ahead of release

The European Union starts process to open up Android to AI competitors