Many open source projects have developed comprehensive language models that can be trained to perform specific tasks. These models can provide useful responses to user questions and commands. Notable examples include LLaMA-based Alpaca and Vicuna and Pythia-based OpenAssistant and Dolly.
Despite new models being released every week, the community still has a hard time comparing them correctly. Since LLM attendees’ concerns are often vague, it’s difficult to create a benchmarking system that can automatically assess the quality of their responses. Human assessment via pairwise comparison is often required here. A scalable, incremental, and distinctive referral system based on peer-to-peer comparison is ideal.
Few of today’s LLM benchmarking systems meet all of these requirements. Classic LLM frameworks such as HELM and lm-evaluation-harness provide multimetric measures for standard research tasks. However, they do not rate free-form questions well because they are not based on pairwise comparisons.
LMSYS ORG is an organization that develops large models and systems that are open, scalable, and accessible. His new job features Chatbot Arena, a crowdsourced LLM benchmark platform with anonymous and random battles. As with chess and other competitive games, the Elo rating system is employed in Chatbot Arena. The Elo rating system shows promise in delivering the aforementioned desirable quality.
They started collecting information a week ago when they opened the arena with many well-known open source LLMs. Some examples of real world applications of LLM can be seen in the crowdsourcing method of data collection. A user can compare and contrast two anonymous models while simultaneously chatting with them in the arena.
FastChat, the multi-model service system, hosted the arena at https://arena.lmsys.org. A person entering the arena will be faced with a conversation with two unnamed models. When consumers receive feedback on both models, they can continue the conversation or vote on which one they prefer. After casting a vote, the identities of the models will be unmasked. Users can continue chatting with the same two anonymous models or start a new battle with two new models. The system logs all user activity. Only when the names of the models have obscured the votes in the analysis used. Some 7,000 legitimate and anonymous votes have been counted since the arena went live a week ago.
In the future, they want to implement improved sampling algorithms, tournament procedures, and service systems to accommodate a greater variety of models and provide granular ranges for various tasks.
review the Paper, Code, and Project. Don’t forget to join our 20k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.