Hugging Face has announced the launch of the Open LLM v2 leaderboard, a significant update designed to address the challenges and limitations of its predecessor. The new leaderboard introduces more rigorous benchmarks, refined evaluation methods, and a fairer scoring system, promising to reinvigorate the competitive landscape of language models.
Addressing benchmark saturation
Over the past year, the original Open LLM leaderboard has become a vital resource in the machine learning community, attracting more than 2 million unique visitors and engaging 300,000 monthly active users. Despite its success, the increasing performance of the models led to saturation of the benchmark indices. Models began to reach basic human performance on benchmarks such as HellaSwag, MMLU, and ARC, reducing their effectiveness in distinguishing model capabilities. Additionally, some models showed signs of contamination as they were trained on data similar to the benchmarks, compromising the integrity of their scores.
Introduction of new reference points
To counter these issues, Open LLM Leaderboard v2 introduces six new benchmarks that cover a variety of model capabilities:
- MMLU-Pro: An improved version of the MMLU dataset, featuring ten choice questions instead of four, requiring more reasoning and expert review to reduce noise.
- GPQA (Google Q&A Benchmark Test): A highly challenging knowledge dataset designed by domain experts to ensure difficulty and factuality, with gating mechanisms to prevent contamination.
- MuSR (Multi-Step Soft Reasoning): A dataset of algorithmically generated complex problems, including murder mysteries and team assignment optimizations, to test long-range reasoning and context analysis.
- MATHEMATICS (Heuristic Mathematical Aptitude Test, Level 5 Subset): High school level proficiency problems formatted for rigorous assessment, focusing on the most difficult questions.
- IFEval (Post-Evaluation Instruction): Tests the ability of models to follow explicit instructions, using rigorous metrics for evaluation.
- BBH (Big Hard Bank): A subset of 23 challenging tasks from the BigBench dataset covering multi-step arithmetic, algorithmic reasoning, and language comprehension.
Fairer rankings with normalized scoring
A notable change in the new leaderboard is the adoption of normalized scores for ranking models. Previously, raw scores were summed, which could skew performance due to the varying difficulty of benchmarks. Now, scores are normalized between a random baseline (0 points) and the maximum possible score (100 points). This approach ensures a fairer comparison between different benchmarks, preventing a single benchmark from disproportionately influencing the final ranking.
For example, in a benchmark with two options per question, a random baseline would score 50 points. This raw score would be normalized to 0, aligning scores across benchmarks and providing a clearer picture of the model's performance.
Improved reproducibility and interface
Hugging Face has updated the evaluation suite in collaboration with EleutherAI to improve reproducibility. Updates include support for delta weights (LoRA fine-tuning/adaptation), a new logging system with leaderboard support, and the use of chat templates for judging. Additionally, manual checks were performed on all implementations to ensure consistency and accuracy. The interface has also been significantly improved. Thanks to the Gradio team, particularly Freddy Boulton, the new leaderboard component loads data on the client side, making searches and column selections a snap. This improvement provides users with a faster and smoother experience.
Prioritize community-relevant models
The new ranking features a “maintainer’s choice” category that highlights high-quality models from a variety of sources, including large companies, startups, collectives, and individual contributors. This curated list aims to include state-of-the-art LLMs and prioritize evaluations of the most useful models for the community.
Voting on the relevance of the model
A voting system has been implemented to manage the high volume of model submissions. Community members can vote for their preferred models and those with the most votes will be prioritized for evaluation. This system ensures that the most anticipated models are evaluated first, reflecting the interests of the community.
In conclusion, Hugging Face’s Open LLM Leaderboard v2 represents a major milestone in language model evaluation. With its more challenging benchmarks, fairer scoring system, and improved reproducibility, it aims to push the boundaries of model development and provide more reliable insights into model capabilities. The Hugging Face team is optimistic about the future and looks forward to continued innovation and improvement as more models are evaluated on this new, more rigorous leaderboard.
Review the Leaderboard and Details. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram channel and LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our SubReddit of over 45,000 ml
Create, edit, and augment tabular data with the first composite ai system, Gretel Navigator, now widely available! (Commercial)
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. His most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>