Benchmarks are often hailed as a hallmark of success. They're a famous way to measure progress, whether it's hitting the sub-4-minute mile or the ability to excel on standardized tests. In the context of artificial intelligence (ai), benchmarks are the most common method to evaluate the capability of a model. Industry leaders like OpenAI, Anthropic, Meta, Google, etc. They compete in a race to outdo each other with superior benchmark scores. However, recent research studies and industry complaints are casting doubt on whether common benchmarks truly capture the essence of a model's capability.
Emerging research points to the likelihood that some models' training sets have been contaminated with the same data they are being tested on, raising questions about the authenticity of their baseline scores that reflect true understanding. Just like in movies where actors may play doctors or scientists, they deliver the lines without really understanding the underlying concepts. When Cillian Murphy played famous physicist J. Robert Oppenheimer in the movie Oppenheimer, she probably didn't understand the complex physical theories he was talking about. Although benchmarks are intended to evaluate a model's capabilities, do they really do so if, as an actor, the model has memorized them?
Recent findings from the University of Arizona have found that GPT-4 is contaminated with AG News, WNLI and XSum data sets that discredit its associated benchmarks.(1). Additionally, researchers from the University of Science and technology of China found that when they implemented their “polling” technique on the popular MMLU Benchmark (2)the results decreased dramatically.
His probing techniques included a series of methods intended to challenge the model's understanding of the question when it is posed in different ways with different answer options, but the same correct answer. Examples of probing techniques consisted of: paraphrasing questions, paraphrasing options, permuting options, adding additional context to questions, and adding a new option to reference questions.
From the graph below, it can be deduced that although each model tested performed well on the unaltered “vanilla” MMLU benchmark, when probing techniques were added to different sections of the benchmark (LU, PS, DK, All ) did not perform as strongly. .
This evolving situation prompts a reevaluation of how ai models are evaluated. The need for benchmarks that reliably demonstrate capabilities and anticipate data contamination and memorization issues is becoming evident.
As models continue to evolve and are updated to potentially include benchmark data in their training sets, benchmarks will have an inherently short lifespan. Additionally, model context windows are increasing rapidly, allowing more context to be included in the model response. The larger the context window, the greater the potential impact of contaminated data indirectly biasing the model's learning process, making it biased toward the test examples seen.
To address these challenges, innovative approaches are emerging, such as dynamic benchmarks, which employ tactics such as: altering questions, complicating questions, introducing noise into the question, paraphrasing the question, reversing question polarity, and more. (3).
The following example provides an example of several methods for modifying reference questions (either manually or using a generated language model).
As we move forward, the need to more closely align evaluation methods with real-world applications becomes evident. Establishing benchmarks that accurately reflect practical tasks and challenges will not only provide a truer measure of ai capabilities, but will also guide the development of small language models (SLMs) and ai agents. These specialized models and agents require benchmarks that truly capture their potential to perform practical and useful tasks.