Iterative preference optimization methods have demonstrated effectiveness in general instruction tuning tasks, but produce limited improvements in reasoning tasks. These methods, using preference optimization, improve the alignment of the language model with human requirements compared to supervised tuning alone. Offline techniques like DPO are gaining popularity due to their simplicity and efficiency. Recent advances advocate the iterative application of offline procedures, such as iterative DPO, self-rewarded LLM, and SPIN, which build new preference relationships to further increase model performance. However, preference optimization remains unexplored in this domain despite the successful application of other iterative training methods such as STaR and RestEM to reasoning tasks.
Iterative alignment methods encompass both human and automated strategies. While some rely on human feedback for reinforcement learning (RLHF), others, such as Iterative DPO, optimize preference pairs autonomously, generating new pairs for subsequent iterations using updated models. SPIN, a variant of Iterative DPO, uses human labels and model generations for preference construction, but faces limitations when model performance matches human standards. Self-rewarding LLMs also employ iterative DPO, with the model itself as a reward evaluator, yielding gains in instruction after modest improvements in reasoning. In contrast, Expert Iteration and STaR focus on sample curation and training data refinement, as opposed to pairwise preference optimization.
FAIR researchers at Meta and New York University present an approach aimed at iterative preference optimization for reasoning tasks, specifically chain-of-thought (CoT) reasoning. Each iteration involves sampling multiple CoT reasoning steps and final responses, constructing preference pairs where winners have correct answers and losers have incorrect answers. The training involves a variant of DPO that incorporates a negative log likelihood loss (NLL) term for pair winners, which is essential for improving performance. The iterative process generates new pairs and retrains the model from the previously trained iteration, thus refining the performance of the model incrementally.
Their approach is based on a base language model, typically pre-trained or tuned to instructions, and a data set of training inputs, with the ability to evaluate the correctness of the final result. Given a training input, the model generates (i) a sequence of reasoning steps (thought chain) and (ii) a final response. While the accuracy of the final answers can be evaluated, the precision of the reasoning steps is not considered. The experiments use gold-labeled data sets for training inputs, obtaining a binary reward from exact matches between the labels and the final responses. The method consists of two steps per iteration: (i) chain of thought and response generation and (ii) preference optimization.
In experimentation, researchers were trained to use a modified DPO loss with an additional negative log-likelihood term that was considered essential. Reasoning ability improves with successive iterations of this method. Based solely on training set examples, the approach produces increasing accuracy for Llama-2-70B-Chat, rising from 55.6% to 81.6% on GSM8K (and 88.7% with a majority of votes from 32 samples), from 12.5% to 20.8%. in MATH, and from 77.8% to 86.7% in ARC-Challenge. These improvements outperform other Llama-2-based models that do not use additional data sets.
In conclusion, this study presents an iterative training algorithm, Iterative Reasoning Preference Optimization, aimed at improving performance on chain-of-thought-based reasoning tasks for LLM. Each iteration generates multiple responses and constructs preference pairs based on the correctness of the final response, employing a modified DPO loss with an additional NLL term for training. The method requires no human intervention or additional training data, maintaining simplicity and efficiency. Experimental results show substantial improvements in GMS8K, MATH, and ARC-Challenge compared to several baselines using the same base model and training data. These findings underscore the effectiveness of the iterative training approach in improving LLMs' reasoning abilities.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our SubReddit over 40,000 ml
Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>