Large language models are becoming increasingly complex, making evaluation difficult. The community has produced many benchmarks in a relatively short period of time, but benchmark scores do not always correspond to actual performance. Some evidence suggests that many popular benchmarks may have contaminated data sets used for tuning and pre-training.
Despite widespread agreement that this is an important issue, it has been difficult to identify the source of the contamination. Both ngram overlay and integrated similarity search are widely used. String matching is widely used in next-generation innovations such as GPT-4, PaLM, and Llama for N-gram overlap contamination detection. However, its accuracy is somewhat low. An embedding similarity search analyzes embeddings from pre-trained models (such as BERT) to discover related and perhaps contaminated cases. However, it can be difficult to find the sweet spot between recall and precision when deciding a level of similarity. Additionally, there is a developing trend in model training using synthetic data generated by LLM (e.g. GPT-4), where contamination may be even more difficult to identify by string comparison.
To examine decontamination methods, a new study by UC Berkeley and Shanghai Jiao Tong University introduces the concept of “reformulated sample,” which has the same semantics as the original sample but is difficult to identify using existing contamination tests. . LLMs generate reformulated samples by translating and paraphrasing test samples into another language. The researchers show that if these paraphrased examples are used for training, the resulting model is very susceptible to overfitting and can achieve extremely high performance in benchmark tests. A finely calibrated Llama 13B model can match the performance of GPT -4 in all benchmarks and remain undetected by n-gram overlap as contamination. This behavior is observed in widely used benchmarks such as MMLU, GSM-8k and HumanEval. As a result, the ability to identify reformulated samples is crucial.
Researchers explain the shortcomings of conventional decontamination techniques and suggest a novel LLM-based approach. To determine if any of the top-k samples are too similar to the test instance, they first apply an integrated similarity search to find the models most similar to the test sample in question. The results demonstrate the superiority of the suggested LLM decontaminator over conventional techniques. They test their decontaminator on a variety of popular data sets that are used for fine-tuning and preliminary training. The GPT-3.5 synthetic dataset, CodeAlpaca, was also found to have a considerable number of rephrased HumanEval samples (12.8% to be exact). This suggests possible contamination during training with false data created by LLM.
The researchers advise the community to establish more thorough decontamination procedures to evaluate LLMs using public benchmarks. They hope to create new unique tests, such as the Codeforces and Kaggle competitions, so that fair assessment of LLMs overcomes these fundamental problems.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies spanning Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today’s evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>