As noted in the disclaimer above, to properly understand how LLMs perform on coding tasks, it is advisable to evaluate them from multiple perspectives.
Benchmarking via HumanEval
Initially, I tried to aggregate results from various benchmarks to see which model comes out on top. However, this approach had a central problem: different models use different reference points and configurations. Only one benchmark seemed to be the default for evaluating encoding performance: human evaluation. This is a benchmark dataset consisting of human-written coding problems that evaluates a model's ability to generate correct and functional code based on specific requirements. By assessing code completion and problem-solving skills, HumanEval serves as a standard measure for coding proficiency in LLMs.
The voice of the people through Elo scores
While benchmarks provide a good view of a model's performance, they should also be taken with caution. Given the large amount of data that LLMs are trained on, some benchmark content (or very similar content) could be part of that training. That's why it's beneficial to also evaluate models based on their performance as judged by humans. Elo ratings, such as those of Chatbot Arena (coding only)do precisely that. These are scores derived from direct comparisons of LLM on encoding tasks, evaluated by human judges. The models are pitted against each other and their Elo scores are adjusted based on wins and losses in these pairs matches. An Elo score shows the relative performance of a model compared to others in the group, with higher scores indicating better performance. For example, a difference of 100 Elo points suggests that the higher-rated model is expected to win approximately 64% of the time versus the lower-rated model.
Current state of model performance.
Now, let's examine how these models perform when we compare their HumanEval scores to their Elo ratings. The image below illustrates the current coding landscape for LLM, where models are grouped by the companies that created them. The model with the best performance of each company is noted.
OpenAI models are at the top of both metrics, demonstrating their superior ability to solve coding tasks. Best OpenAI model outperforms best non-OpenAI model: Anthropic Sonnet of Claudius 3.5 – by 46 Elo points, with an expected win rate of 56.6% in direct coding tasks and a difference of 3.9% in HumanEval. While this difference is not overwhelming, it shows that OpenAI still has the advantage. Interestingly, the best model is o1-miniwho gets a higher score than the oldest o1 by 10 Elo points and 2.5% in HumanEval.
Conclusion: OpenAI continues to dominate, ranking at the top in benchmark performance and real-world usage. Surprisingly, o1-mini is the best performing model, outperforming its larger counterpart, o1.
Other companies follow closely and appear to exist within the same “performance field.” To provide a clearer idea of the difference in model performance, the figure below shows the winning probabilities of each company's best model, as indicated by its Elo rating.