On the limited generalization capacity of the implicit reward model induced by direct preference optimization.

Reinforcement learning from human feedback (RLHF) is an effective approach to align language models with human preferences. Fundamental to RLHF is learning a reward function to rate human preferences. Two main approaches to learning a reward model are 1) training an explicit reward model as in RLHF and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Previous work has shown that the DPO implicit reward model can approximate a trained reward model, but it is unclear to what extent the DPO can generalize to distribution changes, a problem that may occur due to limited preference data or changes. in the language of the trained model. . We address this question by comparing the accuracy in distinguishing preferred and rejected responses using DPO and RLHF rewards. Our findings indicate that DPO implicit reward performs similarly to RLHF rewards on distribution data, but significantly underperforms RLHF reward models. Across five out-of-domain configurations, DPO has a mean drop in accuracy of 3% and a maximum drop of 7%, highlighting the shortcomings of DPO's implicit reward model for preference optimization. These findings highlight that the implicit reward model of DPO has limited generalizability and grounds the integration of an explicit reward model into iterative DPO approaches.

On the limited generalization capacity of the implicit reward model induced by direct preference optimization.

Technical Terrence Team

GM executive attacks Ford CEO "secret" electric vehicle strategy

Leave a Reply Cancel reply

Recommended.

Apple Watch Ultra 2 and Apple Watch Series 9 hands-on

3 key Ethereum price metrics suggest that ETH is gearing up for volatility

SEC Indicts Lindsay Lohan and Other Celebrities for Illegally Promoting Crypto

Popular Anime Franchise Ghost in the Shell to Release as NFT Collection on October 31

MEET48 sponsors W2140 Bangkok AI+WEB3 Expo, 9 SNH48 idols will perform and host fan meeting on November 12-13

Categories

Important Links

On the limited generalization capacity of the implicit reward model induced by direct preference optimization.

Related

Technical Terrence Team

GM executive attacks Ford CEO "secret" electric vehicle strategy

Leave a Reply Cancel reply

Recommended.

Apple Watch Ultra 2 and Apple Watch Series 9 hands-on

3 key Ethereum price metrics suggest that ETH is gearing up for volatility

SEC Indicts Lindsay Lohan and Other Celebrities for Illegally Promoting Crypto

Popular Anime Franchise Ghost in the Shell to Release as NFT Collection on October 31

MEET48 sponsors W2140 Bangkok AI+WEB3 Expo, 9 SNH48 idols will perform and host fan meeting on November 12-13

Categories

Important Links

Get daily news updates to your inbox!