Iterative preference optimization to improve reasoning tasks in language models
Iterative preference optimization methods have demonstrated effectiveness in general instruction tuning tasks, but produce limited improvements in reasoning tasks. These ...