HyPO: A hybrid reinforcement learning algorithm that uses offline data for contrast-based preference optimization and unlabeled online data for KL regularization
A fundamental aspect of ai research involves tuning large language models (LLMs) to align their outputs with human preferences. This ...