When it comes to artificial intelligence, appearances can be deceptive. The mystery surrounding the inner workings of large language models (LLMs) stems from their enormous size, complex training methods, difficult-to-predict behavior, and difficult interpretation.
Researchers at MIT’s Computer Science and artificial intelligence Laboratory (CSAIL) recently took a closer look at how LLM students perform on various tasks, revealing interesting insights into the interplay between memorization and reasoning skills. It turns out that their reasoning abilities are often overestimated.
The study compared “default tasks” — the common tasks that a model is trained and tested on — to “counterfactual scenarios” — hypothetical situations that deviate from the default conditions that models like GPT-4 and Claude can typically handle. The researchers developed some tests outside the models’ comfort zones, modifying existing tasks rather than creating entirely new ones. They used a variety of data sets and benchmarks specifically designed for different aspects of the models’ capabilities for things like arithmetic, chess, code evaluation, logical question answering, and so on.
When users interact with language models, any arithmetic operations are usually performed in base 10, the numerical base familiar to the models. But observing that they perform well in base 10 might give us the false impression that they are highly proficient in addition. Logically, if they really possess good addition skills, one would expect high and reliable performance in all number bases, similar to calculators or computers. In fact, research showed that these models are not as robust as many initially think. Their high performance is limited to common task variants, and they suffer a consistent and severe performance drop in unfamiliar counterfactual scenarios, indicating a lack of generalizable addition ability.
The pattern held across many other tasks, such as musical chord fingering, spatial reasoning, and even chess problems where the starting positions of pieces were slightly altered. While human players would still be expected to be able to determine the legality of moves in altered scenarios (given enough time), the models struggled and could not perform better than with random guesses, meaning they have limited ability to generalize to unfamiliar situations. And much of their performance on standard tasks is likely due not to general task skills, but to overfitting or straight-up memorization of what they've seen in their training data.
“We’ve discovered a fascinating aspect of large language models: they excel in familiar scenarios, almost like a well-trodden path, but struggle when the terrain becomes unfamiliar. This discovery is crucial as we strive to improve the adaptability of these models and expand their application horizons,” says Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, affiliated with CSAIL, and lead author of a new paper. paper on the research. “As ai becomes increasingly ubiquitous in our society, it must reliably handle a variety of scenarios, whether familiar or unfamiliar. We hope that these insights will one day inform the design of future LLMs with greater robustness.”
Despite the insights gained, there are, of course, limitations. The study’s focus on specific tasks and environments did not capture the full range of challenges that models could potentially face in real-world applications, indicating the need for more diverse test environments. Future work could involve expanding the range of tasks and counterfactual conditions to uncover more potential weaknesses. This could mean analyzing more complex and less common scenarios. The team also wants to improve interpretability by creating methods to better understand the logic behind models’ decision-making processes.
“As language models scale, understanding their training data becomes increasingly difficult, even for open models, and even more so for proprietary ones,” says Hao Peng, an assistant professor at the University of Illinois at Urbana-Champaign. “The community remains puzzled about whether these models actually generalize to unseen tasks or seemingly succeed by memorizing the training data. This paper takes important steps toward addressing this question. It constructs a set of carefully designed counterfactual evaluations, which provide new insights into the capabilities of state-of-the-art language models. It reveals that their ability to solve unseen tasks is perhaps much more limited than many anticipated. It has the potential to inspire future research to identify the failure modes of current models and develop better ones.”
Additional authors include Najoung Kim, an assistant professor at Boston University and a visiting researcher at Google, and seven CSAIL affiliates: MIT electrical engineering and computer science (EECS) PhD students Linlu Qiu, Alexis Ross, Ekin Akyürek SM ’21, and Boyuan Chen; former Apple postdoctoral researcher and ai/ML researcher Bailin Wang; and EECS assistant professors Jacob Andreas and Yoon Kim.
The team’s study was supported, in part, by the MIT–IBM Watson ai Lab, the MIT Quest for Intelligence, and the National Science Foundation. The team presented the work at the North American Chapter of the Association for Computational Linguistics (NAACL) last month.