Program synthesis, or the automatic creation of computer programs from an input specification, is a crucial problem in software engineering. Efficient program synthesis can not only help the productivity of software engineers, but also make it easier to write code. Pretrained large language models (LLMs) have recently shown significant progress in program synthesis, but despite extensive prior training, they still need to consistently generate suitable code.
For example, raw code pulled from the Internet and used as part of code pre-training data sets often has many security flaws. Researchers postulate that contemporary LLM pre-training setups are substantially to blame for these shortcomings. Incorporating written feedback into LLM has been shown to significantly increase the pass rates of code generation models when input is provided at the time of testing.
Researchers suggest imitation learning from linguistic feedback to train LLMs with linguistic feedback. This algorithm extends the work of Scheurer, who investigated the effects of learning from language feedback in text summarization models. By retraining the base model on improved summaries produced from the initial model summaries and human-written comments, Scheurer improves a summary model. The researchers’ work advances Scheurer in a number of ways, including:
🔥 Must Read: The transformative impact of artificial intelligence on hardware development: its applications, the need for chip redesign, market growth, and who is the leading AI chip maker
- By formalizing the algorithm and making it universally applicable in a form
- By demonstrating how the reward function can be modified to generate code
- Presenting an ILF code (Imitation learning from Language Feedback) developing a proof of concept.
ILF (Imitation Learning from Language Feedback) trains a different model called “Refine” to use language feedback to correct incorrectly created programs to increase the accuracy of programs produced by a reference code generation model called πYo. The researchers then improve upon fitting it in the πRefine generated refinements that pass unit tests, resulting in a final improved model πYo*. Researchers refer to the fixed programs as refinements.) This process can be considered to minimize the predicted KL divergence from an actual target terrain distribution, and can be repeated iteratively to further improve the model.
Investigation and Findings
The researchers use the Mostly Basic Python Problems (MBPP) dataset to train and test the models. The Python 974 Programming Assignments at MBPP are created for beginning programmers.
Although the data set has a designated application/training/validation/test division in MBPP, the researchers further divided it into the following divisions:
• MBPPRefine – These jobs have IDs at 111-310. However, CODEGEN-MONO 6.1B failed to complete them accurately. To train πRefineuse this division.
• MBPPTrain: These tasks have IDs in the range of 311 to 974, but CODEGEN-MONO 6.1B failed to complete them accurately. This division is initially used to evaluate the precision of the refinements produced by πRefine. Then, it is trained to produce using the appropriate refinements in this division.
• MBPPTest: Researchers use these tasks, which have IDs between 11 and 110, to assess the final performance of πYo*. Unlike the other two divisions, they use all of the tasks in this division rather than just those for which CODEGENMONO 6.1B did not initially produce accurate programs. This makes it easy for us to compare the performance of πYo and πYo* and at their reference levels.
Researchers independently tune two different instances of CODEGEN-MONO 6.1B to produce πRefine and the final model πYo* implement the algorithm. Error program pairs, human-written comments, and human-written refinement targets are used to train the πRefine algorithm.
Although the ILF algorithm only requires the collection of human written feedback for tasks in MBPPTrain (assuming access to some πRefine that are already tuned or can generate refinements via few-shot prompts), researchers collect human-written comments and refinements for all slices of the data for further analysis of the approach. This allows us to compare the fine tuning of the refinements generated by πRefine with adjustments to human-created refinements, for example. ILF requires additional feedback annotations when scaling to various combinations of models and tasks. However, using ILF on one data set may improve model performance on a different data set for the same job. Future studies will include scaling ILF across various workloads and models.
A small sample of MBPP gold programs was used for the training. However, this did not significantly improve accuracy compared to the zero shot inference. The researchers calculated the perplexity of the MBPP gold programs, the πRefine generated refinements and the refinements written by humans using the pretrained CODEGEN-MONO 6.1B model to test the hypothesis that the gold programs of the MBPP dataset may be slightly out of distribution for CODEGEN-MONO 6.1B. The MBPP data set contains more high perplexity programs (i.e. programs with perplexity 102) than the πRefine generated refinements or human-written refinements, even though the distributions of the three data sources appear identical. Since the last two data sets are closer to the original distribution of CODEGEN-MONO 6.1B and are still functional, it is probably easier for CODEGEN-MONO 6.1B to learn from them.
Also, ILF is especially useful when more access to large amounts of gold codes is needed. In this context, ILF is a technique for producing training data that explicitly corrects for flaws in the original model and, at the same time, produces training data that is more similar to the actual model output in the data representation space. So even though both training data sets contain the same number of functionally perfect programs, fitting the model on πRefine the refinements produced do not require changing the weights as much as fitting the model in the MBPP gold programs would.
To sum up
Learning from human written natural language feedback is more efficient in terms of training samples and more effective in terms of coding tasks. An exciting recent discovery is the ability of pretrained extended language models (LLMs) to employ natural language feedback at the time of inference. The researchers extend this finding by formalizing an algorithm, which they refer to as imitation learning from language feedback, to learn from natural language feedback at the time of training (ILF). ILF is easy to use and sample efficient, as it only needs a limited amount of human-written feedback during training and none at test time. The researchers also provide proof of concept in a task that requires the synthesis of a neural program, demonstrating that ILF can be considered a way to minimize KL divergence from the basic reality distribution. Researchers use ILF to outperform fine tuning in the Mainly Basic Python Problems (MBPP) benchmark and fine tuning in human-created fixed programs by increasing the pass rate of a CODEGEN-MONO 6.1B model by 38 Relative % (and 10% absolute) in the MBPP benchmark. The researchers’ findings indicate that purely demo training is inefficient in improving an LLM’s performance on code generation tasks and that learning through human-written natural language feedback is more efficient and effective with samples. .
review the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 17k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Dhanshree Shenwai is a Computer Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking domain with strong interest in AI applications. She is enthusiastic about exploring new technologies and advancements in today’s changing world, making everyone’s life easier.