When trained on massive data sets, huge unsupervised LMs gain powers that surprise even their creators. These models, however, are trained on information produced by people with a diverse range of motivations, goals, and abilities. Not all of these ambitions and abilities can be emulated. It is important to carefully select the desired model responses and behavior from your vast pool of information and skills to create reliable, effective, and manageable systems.
Without using explicit reward models or reinforcement learning, researchers at Stanford University and CZ demonstrate how to optimize a language model to suit human tastes. Their work shows that the RL-based objective employed by current approaches can be exactly optimized with a simple binary cross-entropy objective, considerably simplifying the preference learning process and demonstrating how this can be done in practice.
They propose Direct Preference Optimization (DPO). This new algorithm implicitly achieves the same goal as the existing RLHF algorithms (reward maximization with a KL divergence constraint), but is easier to build and train. While the DPO update intuitively increases the log ratio of preferred to non-preferred responses, it also includes significant dynamic weighting for example that prevents the model from being degraded.
Like other algorithms, DPO evaluates the consistency of a reward function with empirical preference data using a theoretical preference model. While conventional approaches define a preference loss using the preference model to train a reward model, DPO trains a policy that maximizes the learned reward model using a variable change. Thus, DPO can optimize a policy with a simple binary cross-entropy objective given a data set of human preferences on model responses without explicitly learning a reward function or sampling the policy during training.
The work’s findings demonstrate that DPO is as effective as state-of-the-art approaches, such as PPO-based RLHF, for preference-based learning across various tasks, including modulation of sentiment, summarization, and dialogue, with language models containing up to parameters. 6B. 58% of people prefer DPO summaries to PPO summaries (human evaluations), and 61% prefer DPO summaries to human evaluations in the evidence suite. At Anthropic HH, 60% of the time, single turn responses from DPOs are preferred over selective terminations.
The team claims that DPO has many potential uses beyond just training language models based on human preferences. For example, you can train generative models in various modalities.
Evaluations of the proposed model go as far as the 6B parameters, but the team believes that further work should explore scaling from DPO to next-generation models with orders of magnitude more data. The researchers also found that the indicator affects the GPT -4 calculated win rates. In the future, they plan to investigate the most effective means of obtaining expert opinions from the machines.
review the Paper. Don’t forget to join our 22k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.