One of the most critical challenges for LLMs is how to align these models with human values and preferences, especially in generated texts. Most text output generated by models is inaccurate, biased, or potentially harmful (e.g., hallucinations). This misalignment limits the potential use of LLMs in real-world applications in domains such as education, healthcare, and customer service. This is further compounded by the fact that bias accumulates in LLMs; Iterative training processes are bound to make alignment problems worse, and therefore it is not clear whether the output produced will be trusted. Indeed, this is a very serious challenge for a broader and more effective extension of LLM modalities applied to real-world applications.
Current solutions for alignment involve methods such as RLHF and direct preference optimization (DPO). RLHF trains a reward model that rewards the LLM through reinforcement learning based on human feedback, while DPO optimizes the LLM directly with annotated preference pairs and does not require a separate model for rewards. Both approaches rely heavily on massive amounts of human-labeled data, which are difficult to scale. Self-gratifying language models attempt to reduce this dependency by automatically generating preference data without human interference. In SRLMs, a single model typically acts as both a policy model (generating responses) and a reward model that ranks these responses. While this has had some success, its main drawback is that such a process inherently results in a bias in reward iteration. The more a model has been trained in this way on its self-created preference data, the more biased the reward system will be, and this will reduce the reliability of the preference data and degrade overall lineup performance.
In light of these shortcomings, researchers from the University of North Carolina, Nanyang Technological University, National University of Singapore and Microsoft introduced CREAM, which stands for Consistency Regularized Self-Rewarding Language Models. This approach alleviates bias amplification problems in self-reward models by incorporating a regularization term on the consistency of rewards across generations during training. The intuition is to incorporate consistency regularizers that evaluate the rewards produced by the model in consecutive iterations and use this consistency as a guide for the training process. By contrasting the ranking of responses from the current iteration with those from the previous iteration, CREAM finds and focuses on reliable preference data, which hinders the model's tendency to overlearn from noisy or unreliable labels. This novel regularization mechanism reduces bias and further enables the model to learn more efficiently and effectively from its self-generated preference data. This is a huge improvement over current personal reward methods.
CREAM operates within a generalized iterative preference adjustment framework applicable to both self-reward and RLHF methods. Consistency regularization works by comparing the ranking of responses produced by the model in consecutive iterations. More precisely, the consistency between the classifications coming from the current and previous iteration is measured by Kendall's Tau coefficient. This consistency score is then incorporated into the loss function as a regularization term, encouraging the model to rely more on preference data that has high consistency between iterations. Additionally, CREAM refines much smaller LLMs, such as LLaMA-7B, using data sets that are widely available, such as ARC-Easy/Challenge, OpenBookQA, SIQA, and GSM8K. Iteratively, the method strengthens this by using a weighting mechanism for preference data based on its consistency to achieve superior alignment without the need for large-scale human-labeled data sets.
CREAM outperforms the baseline on many downstream tasks in terms of aligning and debiasing self-rewarding models. Notable improvements in accuracy with the method include an increase from 86.78% to 89.52% in ARC-Easy and from 69.50% to 72.06% in SIQA. These consistent improvements over iterations show the power of the consistency regularization mechanism at work. While standard self-reward methods tend to have lower overall reward and alignment consistency, CREAM outperforms existing models, even when compared to systems that use high-quality external reward models. This also maintained the performance improvement without using any external help, showing the robustness of the model in generating reliable preference data. Furthermore, this model continues to improve in terms of accuracy and consistency in reward metrics, truly reflecting the importance of regularization in mitigating reward bias and improving efficiency in self-rewarding. These results further establish CREAM as a robust solution to the alignment problem by providing a scalable and efficient method for optimizing large language models.
In conclusion, CREAM offers a novel solution against the challenge of rewarding bias in self-rewarding language models by introducing a consistency regularization mechanism. By paying more attention to reliable and consistent preference data, CREAM achieves an immense improvement in performance alignment, especially for fairly small models like LLaMA-7B. While this excludes long-term reliance on human-annotated data, this method represents a significant improvement toward scalability and efficiency in preference learning. This therefore positions it as a very valuable contribution to the continued development of LLMs towards real-world applications. The empirical results strongly validate that CREAM indeed outperforms existing methods and can have a potential impact on improving alignment and reliability in LLMs.
look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 50,000ml.
(Next live webinar: October 29, 2024) Best platform to deliver optimized models: Predibase inference engine (promoted)
Aswin AK is a consulting intern at MarkTechPost. He is pursuing his dual degree from the Indian Institute of technology Kharagpur. He is passionate about data science and machine learning, and brings a strong academic background and practical experience solving real-life interdisciplinary challenges.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>