Stanford and UT Austin researchers propose contrastive preference learning (CPL): a simple RL-free reinforcement learning method for RLHF that works with arbitrary MDP and off-policy data

The challenge of matching human preferences to large pre-trained models has gained importance in the study as these models have increased in performance. This alignment becomes particularly challenging when there are inevitably poor behaviors in larger data sets. For this topic, reinforcement learning from human input, or RLHF, has become popular. RLHF approaches use human preferences to distinguish between acceptable and bad behaviors to improve a known policy. This approach has shown encouraging results when used to tune robot rules, improve image generation models, and tune large language models (LLMs) using non-ideal data. This procedure consists of two stages for most RLHF algorithms.

First, user preference data is collected to train a reward model. A commercially available reinforcement learning (RL) algorithm optimizes that reward model. Unfortunately, the foundations of this two-phase paradigm need to be corrected. Human preferences must be mapped by the discounted total rewards or partial performance of each behavioral segment for algorithms to develop reward models from preference data. However, recent research questions this theory and suggests that human preferences should be based on the regret of each action under the ideal policy of the expert’s reward function. Human evaluation probably intuitively focuses on optimization rather than whether situations and behaviors provide greater rewards.

Therefore, the optimal advantage function, or negated regret, may be the ideal number for learning from feedback rather than reward. Two-phase RLHF algorithms use RL in their second phase to optimize the reward function known in the first phase. In real-world applications, temporal credit allocation presents a variety of optimization difficulties for RL algorithms, including the instability of dynamic approximation programming and the high variation of policy gradients. As a result, previous works restrict their scope to avoid these problems. For example, RLHF approaches to LLM assume the contextual bandit formulation, where the policy is given a unique reward value in response to a user question.

The one-step bandit assumption breaks down because users’ interactions with LLMs are multi-step and sequential, even though this reduces the long-term credit allocation requirement and, as a result, high variation of gradients. of policies. Another example is the application of RLHF to low-dimensional state-based robotics problems, which works well for approximation dynamic programming. However, it has yet to be extended to higher-dimensional continuous control domains with image inputs, which are more realistic. In general, RLHF approaches require reducing RL optimization constraints by making restricted assumptions about the sequential nature of the problems or dimensionality. They generally mistakenly believe that the reward function alone determines human preferences.

In contrast to the widely used partial return model, which considers total rewards, researchers from Stanford University, UMass Amherst, and UT Austin provide a new family of RLHF algorithms in this study that employs a regret-based preference model. Unlike the partial return model, the regret-based approach provides accurate information about the best course of action. Fortunately, this eliminates the need for RL, allowing us to address RLHF problems with high-dimensional states and action spaces in the generic MDP framework. Their fundamental finding is to create a bijection between advantage and policy functions by combining the regret-based preference framework with the Maximum Entropy (MaxEnt) principle.

They can set a purely supervised learning objective whose optimum is the best policy under the expert reward by trading optimization over advantages in exchange for optimization over policies. Because their method resembles widely recognized contrastive learning objectives, they call it Contrastive Preference Learning: Three main benefits of CPL over previous efforts. First, because CPL matches optimal advantage using exclusively supervised objectives (rather than using dynamic programming or policy gradients), it can scale as well as supervised learning. Second, CPL is completely off-policy, making it possible to use any non-ideal offline data source. Finally, CPL enables preference searches on sequential data to learn about arbitrary Markov decision processes (MDPs).

To the best of their knowledge, previous techniques for RLHF have yet to satisfy these three requirements simultaneously. They illustrate CPL’s performance on sequential decision-making issues using high-dimensional, suboptimal non-policy inputs to demonstrate that it adheres to the three aforementioned principles. Interestingly, they demonstrate that CPL can learn temporally extended manipulation rules in the MetaWorld Benchmark efficiently using the same RLHF fine-tuning process as dialogue models. To be more precise, they use supervised learning from high-dimensional image observations to pre-train policies, which they then adjust using preferences. CPL can match the performance of previous RL-based techniques without the need for dynamic programming or policy gradients. It is also four times more parameter efficient and 1.6 times faster simultaneously. On five of six tasks, CPL outperforms RL baselines when using denser preference data. Researchers can avoid the need for reinforcement learning (RL) by employing the concept of maximum entropy to create Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.

Review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 32k+ ML SubReddit, Facebook community of more than 40,000 people, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.

If you like our work, you’ll love our newsletter.

we are also in Telegram and WhatsApp.

Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Data Science and artificial intelligence at the Indian Institute of technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around it. She loves connecting with people and collaborating on interesting projects.

<!– ai CONTENT END 2 –>

Meet Retouch4me – a family of ai-powered plugins for photo retouching

Stanford and UT Austin researchers propose contrastive preference learning (CPL): a simple RL-free reinforcement learning method for RLHF that works with arbitrary MDP and off-policy data

Technical Terrence Team

Phantom Launches 'Camera Mint' One-Touch NFT Creation Feature

Leave a Reply Cancel reply

Recommended.

Y00ts shifts from Polygon to Ethereum, returns $3m grant

9 Best Rossum Alternatives for Data Extraction in 2024

CAMPLING GAMING TGE Checker qualification now live with $ 50K rifa

A trip from Bradford Woods to the STEM laboratory

Famous Las Vegas Strip resort casino hit by wrecking ball

Categories

Important Links

Stanford and UT Austin researchers propose contrastive preference learning (CPL): a simple RL-free reinforcement learning method for RLHF that works with arbitrary MDP and off-policy data

Related

Technical Terrence Team

Phantom Launches 'Camera Mint' One-Touch NFT Creation Feature

Leave a Reply Cancel reply

Recommended.

Y00ts shifts from Polygon to Ethereum, returns $3m grant

9 Best Rossum Alternatives for Data Extraction in 2024

CAMPLING GAMING TGE Checker qualification now live with $ 50K rifa

A trip from Bradford Woods to the STEM laboratory

Famous Las Vegas Strip resort casino hit by wrecking ball

Categories

Important Links

Get daily news updates to your inbox!