Can (very) simple mathematics inform RLHF for LLM of large language models? This AI document says yes!

Incorporating human input is a key component of recent impressive enhancements to Large Language Model (LLM) capabilities such as ChatGPT and GPT-4. To use human feedback effectively, a reward model must first be trained that incorporates human preferences, values, and ethical issues. The LLMs are then adjusted using reinforcement learning under the direction of the reward model. This procedure, also known as reinforcement learning from human feedback (RLHF), successfully coordinates LLMs with a human purpose, significantly improving the caliber of interpersonal communication.

It is not easy to create a reward system that is functional and based on human preferences. It becomes very challenging when a human tagger is unable to provide a numerical grade to a reply or completion for a particular post. In contrast, pairwise comparisons of completions in terms of quality are much easier for individuals to do, and this approach was used in the creation of InstructGPT. In particular, a human tagger ranks completions from highest to lowest perceived quality after displaying many LLM-produced completions for the same ad.

Responses are then rewarded according to a reward model developed after training a neural network to match human preference ranges as nearly as possible. Despite certain advantages, such as the elimination of calibration issues, the rankings do not adequately reflect the various reward distributions of multiple ads. This is to make it clear how much better one completion is than another when ranked higher. Since some RLHF indications are open ended or, to put it another way, depend on user history, the distribution of rewards can vary in a wide range; therefore, this concern is particularly relevant.

🚀 JOIN the fastest ML subreddit community

Conversely, some prompts are closed and produce responses that should receive either a high or a low score, resulting in a massive spread of approximately two points for the reward distribution. Examples of the first type of prompt include “Prove the Pythagorean Theorem” and “Is the chicken a dinosaur?”. Examples of the second type include “proving the Pythagorean theorem” and “writing a short story about what AI will look like in 100 years.” The incentive model may only be able to help LLMs to properly measure uncertainty if they consider the subtleties of various signals.

Researchers from Stanford University, Princeton University, and the University of Pennsylvania document an unexpected phenomenon that shows how training a reward model on preference rankings can yield the same reward distribution regardless of cues. This event, which takes place during the late stage of training, is known as a reward collapse. It is interesting to note that before this event was empirically tested, his theoretical analysis anticipated it. They show that a direct optimization program or even simpler, a closed-form expression can be used to numerically infer the collapse reward distribution. His prediction of reward collapse agrees very well with the empirical findings.

His second major contribution is the introduction of a principled strategy for avoiding reward collapse using data from the same optimization program that helped forecast its occurrence. Reward collapse is undesirable because it ignores fine distinctions between different cues and can result in a miscalibration of human choice when LLMs are trained using reinforcement learning and the reward model. Early termination of reward model training is a simple solution to this problem, but it is quite arbitrary and it can be difficult to decide when to terminate.

In essence, they suggest training the reward model with different utility functions based on the cues, so that the resulting reward distribution can be highly spread out or highly concentrated, depending on whether the cue is open or closed. This request-aware technique has the obvious benefit of analytics, allowing for full customization of the reward distribution structure as needed. Their findings demonstrate that reward collapse can be significantly reduced by utilizing this immediate attention technique.

review the Paper and github link. Don’t forget to join our 23k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]

🚀 Check out 100 AI tools at AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.

Check out https://aitoolsclub.com to find 100’s of Cool AI Tools