GPT-4 has been released and is already making headlines. It is the technology behind the popular ChatGPT developed by OpenAI that can generate textual information and mimic humans in answering questions. Following the success of GPT 3.5, GPT-4 is the latest milestone in scaling up deep learning and generative AI. Unlike the previous version, GPT 3.5, which only allows ChatGPT to take text input, the latest GPT-4 is multimodal in nature. Accepts text and images as input. GPT-4 is a transformer model that has been pre-trained to predict the next token. It has been fine-tuned using the concept of reinforcement learning from human feedback and artificial intelligence and uses public data as well as licensed data from third-party providers.
Here are some key points about how models like ChatGPT/GPT-4 differ from traditional language models in your tweet thread.
The main reason why the latest GPT model differs from the traditional ones is the use of the concept of reinforcement learning from human feedback (RLHF). This technique is used in training language models such as GPT-4, unlike traditional language models where the model is trained on a large corpus of text and the goal is to predict the next word in a sentence or the next word in a sentence. most probable sequence of words given a description or a warning. In contrast, reinforcement learning involves training the language model using feedback from human evaluators, which serves as a reward signal that is responsible for evaluating the quality of the text produced. These assessment methods are similar to BERTscore and BARTscore, and the language model continues to be updated to improve the reward score.
A reward model is basically a language model that has been pre-trained on a large amount of text. It is similar to the base language model used to produce text. Joris has given the example of DeepMind Sparrow, a language model trained using RLHF and using three pre-trained Chinchilla 70B models. One of those models is used as the base language model for text generation, while the other two are used as separate reward models for the evaluation process.
At RLHF, data is collected by asking human annotators to choose the best-produced text after a prompt; these choices are then converted to a scalar preference value, which is used to train the reward model. The reward function combines the evaluation of one or more reward models with a policy change constraint that is designed to minimize the divergence (KL divergence) between the output distributions of the original policy and the current policy, thus avoiding overfitting. . The policy is just the language model that produces text and continues to be optimized to produce high quality text. Proximate Policy Optimization (PPO), which is a reinforcement learning (RL) algorithm, is used to update the current policy parameters in RLHF.
Joris Baan has mentioned the potential biases and limitations that can arise when collecting human feedback to train the reward mode. He has excelled in instruct GPT role, the language model that follows human instructions, that human preferences are not universal and can vary depending on the target community. This implies that the data used to train the reward model could affect the behavior of the model and generate undesired results.
The tweet also mentions that decoding algorithms seem to play a minor role in the training process, and ancestral sampling, often with temperature scaling, is the default method. This could indicate that the RLHF algorithm already directs the generator to specific decoding strategies during the training process.
In conclusion, the use of human preferences to train the reward model and guide the text generation process is a key difference between reinforcement learning-based language models, such as ChatGPT/GPT-4, and traditional language models. It allows the model to generate text that is more likely to be rated highly by humans, leading to better, more natural-sounding language.
This article is based on this Joris Baan tweet thread. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 16k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.