Stanford and UT Austin researchers propose contrastive preference learning (CPL): a simple RL-free reinforcement learning method for RLHF that works with arbitrary MDP and off-policy data
The challenge of matching human preferences to large pre-trained models has gained importance in the study as these models have ...