There are several methods to align LLMs with human preferences. Beyond reinforcement learning with human feedback (RLHF), which is often considered too resource-intensive for consistent application to newly tuned models, direct preference optimization (DPO) is one of the most popular alternatives for LLM alignment. .
Although DPO is significantly more cost-effective than RLHF, it still requires a reference model in addition to the “policy” model (i.e., the model that is actively being trained). This means that both models must be loaded into GPU memory simultaneously, which can be challenging for single-GPU setups, especially with large models.
A more memory efficient approach would be to use LoRA for DPO training. Instead of training the entire model, we freeze its parameters and train a small adapter. This method becomes even more efficient if both the policy and reference models share the same base model; In that case, we load the base model once, then load a frozen adapter for the reference model and a trainable adapter for the policy model, which significantly reduces memory requirements.
However, in my opinion, the effect of LoRA on DPO performance is still understudied. While LoRA can closely approximate full training, its performance…