GPT-5 and Gemini 2.5 Pro scores Gold at International Olympiad on Astronomy and Astrophysics

GPT-5 and Gemini 2.5 Pro would make excellent research assistants, but are not yet suited for autonomous discoveries, the study finds.
One of the questions on the exams is calculating the distance of quasars. (Picture: screenshot)
Scientists and judges from the International Olympiad on Astronomy and Astrophysics (IOAA) have given five top AI models a run through the exams from 2022 to 2025 — and top scores were awarded for the models from OpenAI and Google.

The IOAA is a top rated exam for global high school students and is held annually with some 300 participants from 64 countries, and consists of questions to demonstrate deep conceptual understanding, multimodal analysis and multi-step derivations.

The other models participating in the contest was OpenAI’s o3, Claude Opus 4.1 and Claude Sonnet 4.

«Exceptional» in some parts
On the theory exams GPT-5 and Gemini 2.5 Pro achieved scores between 84.2% and 85.6%, placing them among the top two of the humans participating. o3 scored 77.5% and both Claude models had between 64.7% and 60.6% scores.

GPT-5 demonstrated «exceptional multimodal capabilities» on the data analysis exams with a 88.5% score, while the Gemini model scored 75.7% and others were at less than 70%.

The models most consistently failed at conceptual theory exams, with sometimes flawed reasoning and bad geometric/spacial visualizations.

Good enough for genuine research assistants
According to the study, LLMs frequently surpass the best human competitors in theory and data analysis, with the Gemini and GPT-5 models placing first or second in many of the tests.

In other words, they don’t just get Gold performance, but consistently outperformed human participants in these fields.

The result of the survey holds that LLMs are now good enough for genuine scientific reasoning in astronomy and astrophysics, and are fully suitable as co-scientists for research projects.

Still some work to do
They do, however, have achille’s heels in geometric reasoning, multimodal integration and «mathematical rigor,» meaning they cannot yet be trusted to work autonomously.

While they can be «superhuman» research assistants in complex problem-solving, and can really accelerate research, the study finds new areas for development in these areas before they can function not just as partners — but as autonomous agents making discoveries.

As OpenAI researcher Mostafa Rohaninejad tweeted in September after winning the math and coding olympiad, «The next frontier is the discovery of new knowledge, which is the true milestone at the end of the day» — but there is still some way to go according to this benchmark.

Read more: The actual study, and a nice summary. Discussion on r/Singularity.