
FrontierMath is a synthetic benchmark that contains 300 questions spanning from upper-graduate level to Field Medalist challenges, and the best machines on it score about 2%.
Undergrads and PhDs from MIT
For the human versus AI test, they went to MIT and chose 40 academics from exceptional undergrads to PhDs and split them into teams of four to five.
The eight teams were then given internet access and were told to solve 23 of the «easier» questions from the benchmark, while the same was given to the current top performing AI on the benchmark, which is OpenAIs o4-mini-medium.

Better than teams, worse on aggregate
Once added up, the AI beat the average human team by small margins, but lost to the accumulated score of all the human teams. By a large margin.
The sum of the human teams won with 30-40 %, while the AI got a little over 20 %.
So, does this make humans better at solving math than AI, the researchers at Epoch AI ask?
The answer seems to be yes, but the questions in this test revolves more around reasoning than knowledge, to round out AIs already huge advantage of having more knowledge than «even the most erudite human mathematicians.»
o4 finished substantially faster
It also took o4-mini-medium a mere 5-20 minutes to complete each problem, finishing a lot faster than the humans did with around 40 minutes per task.
The report concludes that while humans teams and the current state of the art AI score in about the same ballpark, looking at the benchmark and AI evolution, they expect that to change.
Indeed, the report says they «think it’s likely that AIs will unambiguously exceed this threshold by the end of the year.»
— I think this is a useful human baseline that helps put FrontierMath evaluations into context, and I’m interested to see when AIs cross this threshold, the report concludes.
Read more: The report from Epoch AI and a handy Twitter thread. More on FrontierMath, discussion on r/singularity, and OpenAI’s o3, o4 release from teknotum.