Meta gamed the LMArena AI benchmarks with new model

People are astounded Meta used a non public, unreleased and optimized model on the industry’s most respected benchmark. (Picture: Meta)

According to several reports, it seems Meta used an unpublished Llama 4 Maverick model created especially to score well on the LMArena benchmark.

Surprisingly good ranking
The largest selling point for their latest Maverick model was how well it did in precisely this benchmark, scoring just above ChatGPT 4o-latest and and slightly below Gemini 2.5 Pro, considered the current cutting edge of AI engineering.

The fact that Llama 4 Maverick got second place in between these, on the most watched leaderboard in AI, raised a lot of eyebrows over the weekend.

LMAerana screenshot showing Maverick in second place.

At the same time, real world users were complaining about the quality of the official model’s chatting abilities, surfacing a little known benchmark for emotional intelligence, the Longform Creative Writing, where it scored last of the current models.

«Chat optimized» version
Now it turns out Meta wasn’t precisely forthcoming about the actual model used in the LMArena exams, and has admitted to The Verge that it was in fact using an unpublished, «chat optimized» version of the Llama 4 Maverick bot.

— ‘Llama-4-Maverick-03-26-Experimental’ is a chat optimized version we experimented with that also performs well on LMArena, says Ashley Gabriel to the website.

The kicker is of course that this model has not gone live or been used by any of the web instances of the model, and is not the same model that was released to the wild by Meta on Saturday.

Benchmark changing policies
LMArena says they are looking into how companies might game the benchmark, saying that they are conducting «a deeper analysis» and are «updating policies:»

— Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future, they tweeted.

In response to the allegations of gaming the system, Metas head of generative AI initially said that «we would never do that:»

— We’ve also heard claims that we trained on test sets — that’s simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations, tweeted Ahmad Al-Dahle, VP of generative AI, according to TechCrunch.

Expect many flavors of Llama 4
Metas Ashley Gabriel also note to The Verge in a later story that the Llama models are open source and open weighted, so there will obviously be a lot flavors of it in the future, as users start tinkering with it:

— We have now released our open source version and will see how developers customize Llama 4 for their own use cases. We’re excited to see what they will build and look forward to their ongoing feedback, she says.

This kind of gaming of the benchmark is technically not forbidden, but is seriously misleading to users who view LMArena as the gold standard in an already flawed field of benchmarks, and most would probably agree that they should be testing on the official release of the model.

As of this writing, Llama 4 Maverick Experimental is still ranked second on LMArena, so it has not been taken down just yet.

UPDATE: Meta has since the publishing of this story submitted a production model og the Llama 4 Maverick production model, and it ranks in 38th place:

You have to scroll down see it, but Llama 4 Maverick production model ranks at 38th. (Pricture: Screenshot)

See also: Meta announces Llama 4

Read more: TechCrunch news article, TechCrunch: Meta denial, Later writeup by the Verge, and r/singularity.

Meta gamed the LMArena AI benchmarks with new model

AI use to become mandatory at Microsoft division

Google rolls out Veo 3 for Gemini Pro users globally

With help from top AI labs, American teachers to get better, free training

Grok’s new «companions:» sex crazed lovebot and a profane firestarter

OpenAI launches ChatGPT Agent mode, for tasks both easy and tough

In a first, judge rules training AI on copyrighted works is fair use