Recent advances in the development of LLMs have popularized their use for various NLP tasks that were previously addressed using older machine learning methods. Large language models are capable of solving a variety of linguistic problems, such as classification, summarization, information retrieval, content creation, answering questions, and maintaining a conversation, all using a single model. But how do we know they are doing a good job on all of these different tasks?
The rise of LLMs has brought to light an unsolved problem: we do not have a reliable standard to evaluate them. What makes evaluation difficult is that they are used for very diverse tasks and we lack a clear definition of what is a good answer for each use case.
This article reviews current approaches to evaluating LLMs and introduces a new LLM league table that leverages human evaluation and improves existing evaluation techniques.
The first and usual initial way of evaluation is to run the model on several selected data sets and examine its performance. HuggingFace created a LLM Open Leaderboard where large open access models are evaluated using four well-known data sets (AI2 Reasoning Challenge , HellaSwag , MMLU , VerazQA). This corresponds to automatic evaluation and checks the model’s ability to obtain data for some specific questions.
This is an example of a question from MMLU data set.
Subject: university_medicine
Question: An expected side effect of creatine supplementation is.
- a) muscle weakness
- B) increase in body mass
- C) muscle cramps
- D) electrolyte loss
Answer: (B)
Scoring the model by answering these types of questions is an important metric and serves well to verify facts, but it does not prove the generative capacity of the model. This is probably the biggest disadvantage of this assessment method because generating free text is one of the most important features of LLMs.
There seems to be a consensus within the community that to properly evaluate the model we need a human evaluation. Typically, this is done by comparing the responses of different models.
Comparison of two completion prompts in the LMSYS project – author’s screenshot
Scorers decide which response is better, as seen in the example above, and sometimes quantify the difference in the quality of responses to prompts. LMSYS Org has created a leaderboard which uses this type of human evaluation and compares 17 different models, reporting the elo rating for each model.
Because human evaluation can be difficult to scale, efforts have been made to scale and speed up the evaluation process and this resulted in an interesting project called AlpacaEval. Here, each model is compared to a baseline (text-davinci-003 provided by GPT-4) and human evaluation is replaced with GPT-4 judgment. This is indeed fast and scalable, but can we rely on the model here to do the scoring? We need to be aware of model biases. In fact, the project has shown that GPT-4 can favor longer responses.
LLM assessment methods continue to evolve as the ai community seeks easy, fair and scalable approaches. The latest development comes from the Toloka team with a newai/llm-leaderboard/” rel=”noopener” target=”_blank”> leaderboard continue to advance current evaluation standards.
The new leaderboard compares the model responses with prompts from real-world users that are classified by useful NLP tasks as described in this InstructGPT document. It also shows the overall win rate of each model across all categories.
Toloka Leaderboard – Author Screenshot
The evaluation used for this project is similar to that carried out in AlpacaEval. The scores on the leaderboard represent the win rate of the respective model compared to the Guanaco 13B model, which serves here as a reference comparison. The choice of Guanaco 13B is an improvement on the AlpacaEval method, which uses the soon-to-be-obsolete text-davinci-003 model as a basis.
The actual evaluation is performed by expert human annotators following a series of real-world cues. For each message, annotators are given two answers and asked which one they prefer. You can find details about the methodology. ai/blog/llm-leaderboard/” rel=”noopener” target=”_blank”>here.
This type of human evaluation is more useful than any other automatic evaluation method and should improve the human evaluation used for the LMSYS leaderboard. The disadvantage of the LMSYS method is that anyone with the link can participate in the evaluation, which raises serious questions about the quality of the data collected in this way. A closed group of expert annotators has greater potential for reliable results, and Toloka applies additional quality control techniques to ensure data quality.
In this article, we present a promising new solution for evaluating LLMs: the Toloka league table. The approach is innovative, combining the strengths of existing methods, adding task-specific granularity, and using reliable human annotation techniques to compare the models.
Explore the board and share your opinions and suggestions for improvement with us.
Magdalena Konkiewicz is a data evangelist at Toloka, a global company supporting rapid and scalable ai development. She has a master’s degree in artificial intelligence from the University of Edinburgh and has worked as an NLP engineer, developer and data scientist for companies in Europe and America. She has also been involved in teaching and mentoring data scientists and regularly contributes to publications on data science and machine learning.