Code generation using large language models (LLM) has become a critical research area, but generating accurate code for complex problems in a single attempt remains a major challenge. Even trained human developers often require multiple iterations of trial-and-error debugging to solve difficult programming problems. While LLMs have demonstrated impressive code generation capabilities, their self-debugging ability to analyze bad code and make necessary fixes is still limited. This limitation is evident in open source models such as StarCoder and CodeLlama, which show significantly lower self-refinement performance compared to models such as GPT-3.5-Turbo.
Existing approaches to improving code generation and debugging capabilities in LLMs have followed several different paths. LLMs have demonstrated significant success in various code-related tasks, including code generation, bug fixing, program testing, and muddling. These models use extensive pre-training on vast data sets to understand patterns and generate contextually relevant code. However, most of the existing work has primarily focused on single-round generation rather than iterative improvement. Other methods such as ILF, CYCLE, and Self-Edit have explored supervised fitting approaches, while solutions such as OpenCodeInterpreter and EURUS have attempted to create high-quality multi-turn interaction datasets using advanced models for fitting purposes.
Researchers from Purdue University, AWS ai Labs, and the University of Virginia have proposed LEDEX (Learn to Self-Debug and Explain Code), a novel training framework designed to improve the self-debugging capabilities of LLMs. The framework is based on the observation that a sequential process of explaining bad code followed by refinement allows LLMs to better analyze and improve bad code. LEDEX implements an automated process to collect high-quality data sets to explain and refine code. Furthermore, it combines supervised fine-tuning (SFT) and reinforcement learning (RL) approaches, using successful and unsuccessful trajectories with a specialized reward system that evaluates code explainability and refinement quality.
LEDEX employs a comprehensive architecture containing multi-stage collection, verification and training processes. The framework begins by collecting code explanation and refinement data sets through queries to pre-trained or instruction-tuned models. These answers undergo rigorous execution-based verification to filter and maintain only high-quality explanation and refinement data. The collected data set then serves as input for supervised tuning that significantly improves the model's capabilities in error explanation and code refinement. LEDEX uses programming problems from MBPP, APPS and CodeContests to train data. To expand the data set of incorrect solutions, the framework requests pre-trained LLMs such as StarCoder and CodeLlama with 3-shot examples to generate 20 solutions per problem.
LEDEX is evaluated using three backbone models: StarCoder-15B, CodeLlama-7B and CodeLlama-13B, with initial training data collected from GPT-3.5-Turbo. The SFT phase shows significant improvements, achieving an increase of up to 15.92% on pass@1 and 9.30% on pass@10 metrics across four benchmark data sets. The subsequent RL phase further improves performance with additional improvements of up to 3.54% on pass@1 and 2.55% on pass@10. In particular, the model-independent nature of LEDEX is shown through experiments with CodeLlama-7B, which achieve substantial improvements (8.25% on pass@1 and 2.14% on pass@10) even when trained on data collected from CodeLlama-34B or itself, demonstrating its Effectiveness independent of GPT-3.5-Turbo.
In conclusion, the researchers presented LEDEX, a comprehensive and scalable framework that combines automated data collection, verification processes, SFT and RL with innovative reward designs to significantly improve the ability of LLMs to identify and fix code errors. The model-independent nature of the framework is evidenced by its successful implementation with GPT-3.5-Turbo and CodeLlama, while its rigorous data verification process ensures the quality of code explanations and improvements. Human evaluations further validate the effectiveness of the framework, confirming that LEDEX-trained models produce superior code explanations that effectively help developers understand and resolve code issues.
Verify he <a target="_blank" href="https://www.amazon.science/publications/training-llms-to-better-self-debug-and-explain-code” target=”_blank” rel=”noreferrer noopener”>Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 60,000 ml.
Trending: LG ai Research launches EXAONE 3.5 – three frontier-level bilingual open-source ai models that deliver unmatched instruction following and broad context understanding for global leadership in generative ai excellence….
Sajjad Ansari is a final year student of IIT Kharagpur. As a technology enthusiast, he delves into the practical applications of ai with a focus on understanding the impact of ai technologies and their real-world implications. Its goal is to articulate complex ai concepts in a clear and accessible way.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>