The well-known artificial intelligence (ai) based chatbot i.e. ChatGPT, which has been built on the transformative architecture of GPT, uses the Reinforcement Learning from Human Feedback (RLHF) technique. RLHF is an increasingly important method to use the potential of pre-trained large language models (LLMs) to generate more useful and truthful responses that are in line with human preferences.
In RLHF, a language model is trained to produce responses that maximize the learned reward through reinforcement learning, after which a reward model is trained based on human preferences for particular cues. Since collecting human ratings is typically less complicated than collecting demos for supervised adjustments, this approach streamlines the data collection process.
However, bounty hacking is a subtle problem with RLHF, where the policy reaps a large reward without meeting actual objectives. This happens as a result of the limited out-of-distribution (OOD) generalization of the reward model and possible imperfections in the representation of human preferences. Being a robust LLM, the language model can provide OOD examples to exploit the flaws of the reward model.
The scenario is further complicated by data on human preferences, which are often biased and inconsistent due to the complexity and subjectivity of the tasks, flaws in grading standards, and the low caliber of raters. Verbosity is a popular example of bounty hacking, where models produce more tokens to make responses appear more complete or better formatted, but there is no real improvement in quality.
To address these issues, recent research from NVIDIA and the University of Maryland aimed to mitigate reward hacking by examining how RL algorithms and incentive models affect verbosity and performance. The team has presented an evaluation technique to compare various training configurations and account for biases in model-based evaluations. The technique has provided comprehensive knowledge of various response durations by evaluating performance on the Pareto front of evaluation score versus duration.
This process aims to analyze the balance between the LLM assessment score and response duration, allowing a systematic comparison of different training environments. By varying the training hyperparameters, one can evaluate how these modifications affect the relationship between verbosity and response quality.
The study analyzes RL hyperparameters and techniques, such as reward clipping and length penalty, to decrease length-based reward hacking. The primary goal is to eliminate the false reward length signal, although various tuning procedures may produce better results. To achieve this, the team has suggested a two-headed reward model that separates the length of representations from true preferences. The head length is removed during RL.
The suggested technique for disentangling the reward, ODIN, has been used, with the help of which, even with a more expensive adjustment budget, the policy was able to achieve a larger Pareto front than the previous results. Proximal policy optimization (PPO) and ReMax benefit from the effectiveness of ODIN, indicating that it can be used to improve other RL tuning methods and reduce hacking duration.
In conclusion, the experimental results of this method have shown a notable decrease in the association of reward pattern with response duration. The derived strategy works significantly better when information quality is prioritized over verbosity. This method successfully reduces the problem of reward hacking related to response duration, improving the reliability and usefulness of LLMs trained using the RLHF paradigm.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter and Google news. Join our 37k+ ML SubReddit, 41k+ Facebook community, Discord Channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.
<!– ai CONTENT END 2 –>