On the limited generalization capacity of the implicit reward model induced by direct preference optimization.
Reinforcement learning from human feedback (RLHF) is an effective approach to align language models with human preferences. Fundamental to RLHF ...