Pretrained Large Language Models (LLMs) are fast becoming the main paradigm for a wide range of language activities, including the creation and completion of computer code. LLMs have shown improved performance with increasing model size in many real-world tasks, including programming tasks. However, more recently, researchers have discovered several tasks that show inverse scaling, where output quality decreases rather than improves with increasing model size. Reverse scaling tasks typically include social biases, where larger models (perhaps correctly) detect unwanted biases from skewed training sets or extremely rare but still recognizable examples of spoken language.
These extreme tasks do not necessarily indicate failure modes important for practical applications because they tend to be highly artificial and may involve strange speech pragmatics or require reasoning on counterfactual information. In this paper, researchers from the University of Edinburgh and Heriot-Watt University offer a new type of reverse scaling work that involves creating Python code while changing default identifiers. This has immediate practical ramifications (default identifier redefinition is a metaprogramming technique used in well-known libraries) and more general scientific ramifications because it shows that LLMs are flawed in their ability to reason about the complex and abstract semantic structure of programming languages. and that increasing the size of the model does not improve these problems, but may even make them worse.
Programming languages are particularly well suited to automated parsing and procedure creation due to their clear and well-defined syntax and semantics. They are scientifically intriguing because, unlike other NLP tasks, which have too much ambiguity to produce high-quality examples automatically, they can be used to automatically generate instances of coding difficulties and test them against an objective ground truth. Additionally, this study is useful for software engineering platforms that employ LLMs, such as GitHub Copilot2, which are becoming widely used by developers.
In cases where correct continuations are statistically rare due to redefinition of identifiers produced by a statement they placed in the flag, they investigated the ability of large language models to predict correct continuations of Python program fragments. Not only do all models examined perform poorly on this task, but several families of models exhibit inverse scaling, meaning that as model size increases, they get worse rather than better. These findings imply that LLMs rely on “shortcut learning,” or weak, unstable, and largely lexical correlations in data, rather than a thorough understanding of the semantics of the data (in this case, the Python code). These findings are crucial to improve the scientific understanding of LLM capabilities and its applicability as a foundational technology for automated code creation tools. Future research could examine the impacts of scaling in other programming languages and larger model sizes.
review the Paper and github link. Don’t forget to join our 22k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.