This article was accepted into the Natural Language Reasoning and Structured Explanations Workshop at ACL 2024.
Reinforcement learning from ai feedback (RLAIF) has shown significant potential in several domains including mitigating harm in LLM results, improving text summarization, and mathematical reasoning. This paper presents an RLAIF framework to improve the code generation capabilities of lightweight LLMs (parameters < 1B). We specifically focus on code generation tasks that require writing appropriate API calls, which is challenging due to the well-known hallucination problem in LLMs. Our framework extracts ai feedback from a larger LLM (e.g., GPT-3.5) through a specialized induction strategy and uses this data to train a reward model towards better alignment from smaller LLMs. We run our experiments on the Gorilla dataset and meticulously evaluate the quality of the code generated by the model on several metrics including AST, ROUGE, and Code-BLEU, and develop a pipeline to accurately compute its executable rate. Our approach significantly improves the performance of the fine-tuned LLM baseline, achieving a 4.5% improvement in the executeability rate. In particular, a smaller LLM model (780M parameters) trained with RLAIF outperforms a much larger fine-tuned baseline with 7B parameters, achieving a 1.0% higher code executeability rate.