The debates on ai reference points, and how they are informed by ai Labs, are spilling to public sight.
This week, an operai employee <a target="_blank" rel="nofollow" href="https://x.com/BorisMPower/status/1892407015038996740″>accused Elon Musk's company, XAI, to publish deceptive reference results for its latest ai model, Grok 3. One of Xai's co -founders, Igor Babushkin, <a target="_blank" rel="nofollow" href="https://x.com/ibab/status/1892418351084732654″>insisted that the company was on the right.
The truth is at some intermediate point.
In <a target="_blank" rel="nofollow" href="https://x.ai/blog/grok-3″>Publish on Xai's blogThe company published a graph that shows the performance of Grok 3 in Aime 2025, a collection of challenging mathematical questions of a recent invitation mathematics exam. Some experts have <a target="_blank" rel="nofollow" href="https://x.com/DimitrisPapail/status/1888325914603516214″>questioned Aime's validity as a reference point of ai. However, Aime 2025 and oldest versions of the test are commonly used to investigate the mathematical capacity of a model.
The XAI graph showed two variants of Grok 3, Grok 3 Beta Reasoning and Grok 3 Mini Reasoning, surpassing the available OpenAi, O3-mini-High model, in Aime 2025. But Openai employees in x hurried To point out that the XAI's Graph graphic did not include the 2025 O3-mini-High score in “cons@64”.
What is cons@64, could you ask? Well, it is the lack of “consensus@64”, and basically gives a 64 model tries to respond to each problem at a reference point and take the most frequently generated answers as the final answers. As you can imagine, cons@64 tends to increase the reference scores of the models a bit, and omit it from one graph could make a model exceed another when in reality, that is not the case.
Grok 3 Reasoning Beta and Grok 3 Mini Reasoning The scores for Aime 2025 in “@1”, which means that the first score that the models obtained at the point of reference, fall below the O3-mini-High score. Grok 3 Reasoning Beta also remains very good behind the OPENAI O1 model established in “medium” computing. However, Xai is <a target="_blank" rel="nofollow" href="https://x.com/xai/status/1892400129719611567″>Grok 3 advertising As the “most intelligent the world.”
Babushkin <a target="_blank" rel="nofollow" href="https://x.com/ibab/status/1892418351084732654″>discussed in x that Openai has published reference graphs similarly in the past, although the purchase lists that compare the performance of their own models. A more neutral party in the debate organized a more “precise” graph that shows almost the performance of all models in cons@64:
<blockquote class="wp-block-quote twitter-tweet is-layout-flow wp-block-quote-is-layout-flow”>
Hilarious how some people see my plot as an attack on Openai and others as an attack against Grok, while it is actually a Speed Speed propaganda
(In fact, I think Grok looks good there, and the TTC Chiconer of OpenAi behind O3-mini-*high*-Pass@”” “” “” “deserves more scrutiny). https://t.co/djqlJPCJH8 pic.twitter.com/3Wh8foffic
– Theortoxes (Deepseek twitteriron Powder 2023 – ∞) (@Teortoxestex) <a target="_blank" rel="nofollow" href="https://twitter.com/teortaxesTex/status/1892535507352961221?ref_src=twsrc%5Etfw”>February 20, 2025
But as a Nathan Lambert researcher <a target="_blank" rel="nofollow" href="https://x.com/natolambert/status/1892675458166382687″>indicated in a publicationPerhaps the most important metric remains a mystery: the computational (and monetary) cost that took each model to achieve its best score. That only shows how few reference points of ai communicate about the limitations of the models and their strengths.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″> (Toshma of all states) (Provide updates) in Triptarence!