In recent years, large language models have attracted significant attention from researchers and laypeople alike due to their impressive capabilities. These models, like GPT-3, can generate human-like text, engage in conversations with users, perform tasks like summarizing text and answering questions, and even write code. There are several scenarios in which the quality of the generated text plays a key role in the evaluation of the language model. For example, for a good user experience, the user expects the model to generate error-free executable code or to write a poem that shows some level of creativity. Therefore, loss functions are used to capture these attributes. Most of the previous research focuses on the use of loss functions based on the prediction of the next token or other similar criteria. However, another upcoming research domain focuses on incorporating human feedback as a performance measure and using that feedback as a loss to optimize the model. This idea is known as reinforcement learning from human feedback (RLHF), and several existing powerful models, such as ChatGPT, GPT-4, and Claude, are currently employing this technique.
Adding another model to the list of successful RLHF applications, Hugging Face researchers are releasing StackLLaMA, a 7B parameter language model based on Meta’s LLaMA model that has been trained to answer Stack Exchange questions using RLHF with the Transformers Reinforcement Learning (TRL) from Hugging Face. ) library. The researchers fitted Meta’s original LLaMA model using a combination of three main strategies: supervised fine-tuning (SFT), reward/preference modeling (RM), and reinforcement learning human feedback (RLHF). The model can be accessed hereand the entire training pipeline is available as part of the TRL library.
The Hugging Face researchers noted that RLHF is only one adjustment step; therefore, deciding on the initial model is a crucial preliminary step. Therefore, the researchers chose the newly introduced larger language models developed by Meta AI, LLaMA models, for their purpose. This collection of basic language models can outperform even GPT-3 and is available in a variety of parameters, ranging from 7B to 65B. The researchers decided to go ahead with the 7B parameter model for their experiments. The researchers also noted that a good data set plays an important role in providing the right human feedback. On this front, the researchers chose the StackExchange dataset, which includes more than 10 million question-answer pairs on a wide range of topics and even StackOverflow code snippets. Another attractive feature of this dataset is that it consists of the number of upvotes and a label for the accepted answer, which was very useful for the reward model.
The Hugging Face team sought to tune the model for a specific domain (in their case, question answering tasks) for the purpose of causal language modeling before training the reward model and tuning it with reinforcement learning. To accomplish this, the team trained the language model on a subset of the StackExchange dataset using a technique known as packing. This efficient technique consists of adding extra tokens to the end of sequences shorter than the desired length or truncating sequences longer than the desired length. The model is then trained for about a thousand epochs, marking the conclusion of the fine-tuning step. The next step was to train the reward model. Since fitting the model using RLHF directly with manual annotations is time consuming and labor intensive, the researchers considered training the reward model using certain tactics that would mimic how a human would evaluate the text. One such strategy is to predict the score based on a certain score or a binary value that indicates whether the score was good or bad. Since the StackExchange dataset consists of at least two answers for each question, the researchers selected a preferred answer based on a certain scoring metric. The researchers applied this methodology to a subset of the data set to test the reward model. Its final accuracy of 67% is extremely appreciable, considering how difficult it is to complete the task even with human annotators.
With the adjusted language model and the reward model in hand, the final step followed by the researchers was to execute the RL loop. This procedure can be summarized in three main steps: generating answers from prompts, scoring the answers with a reward model, and running an optimization step of the reinforcement learning policy against the scores. Based on previous work on training language models with RL, it has been observed that the model can learn to exploit the reward model by generating full gibberish, which causes the reward model to assign high rewards. To counteract this, the researchers even added a penalty to the reward. Based on certain experiments carried out by the team, it is safe to conclude that the resulting model delivers satisfactory results on a wide range of topics.
In a nutshell, the work of the Hugging Face researchers can be summarized as creating a human-annotated dataset, adapting the language model to the domain, training a reward model, and ultimately model training with RL. Although StackLLaMA is an important stepping stone in the RLHF world, the model is far from perfect. There are several ongoing issues that the Hugging Face team is working hard to resolve, such as occasional spikes in losses, which lead to model instability. Currently, the model has been made public for educational and research purposes on RLHF and the TRL library. The team has also explicitly stated that feedback entered into the app is collected to further refine the model. Therefore, users should refrain from sharing sensitive personal information on the app.
review the Manifestation, Codeand Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 18k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Check out 100 AI tools at AI Tools Club
Khushboo Gupta is a consulting intern at MarktechPost. He is currently pursuing his B.Tech at the Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing, and web development. She likes to learn more about the technical field by participating in various challenges.