Language models have gained prominence in reinforcement learning from human feedback (RLHF), but current reward modeling approaches face challenges in accurately capturing human preferences. Traditional reward models, trained as simple classifiers, struggle to perform explicit reasoning about response quality, limiting their effectiveness in guiding LLM behavior. The primary problem lies in their inability to generate reasoning traces, forcing all evaluations to be performed implicitly within a single forward pass. This restriction hampers the model’s ability to thoroughly assess the nuances of human preferences. While alternative approaches such as the LLM-as-a-judge framework have attempted to address this limitation, they generally underperform classical reward models on pairwise preference classification tasks, highlighting the need for a more effective method.
Researchers have attempted a variety of approaches to address the challenges that reward modeling presents for language models. Ranking models such as Bradley-Terry and Plackett-Luce have been employed, but they struggle with intransitive preferences. Some studies directly model the probability of one response being preferred over another, while others focus on modeling rewards based on multiple goals. Recent work has proposed maintaining and training the main language model as a form of regularization.
Criticism-based feedback methods have also been explored, with some using self-generated criticism to improve generation quality or serve as preference signals. However, these approaches differ from efforts to train better reward models when human preference data are available. Some researchers have investigated using oracle criticism or human-labeled criticism preferences to teach language models to criticize effectively.
The LLM-as-a-Judge framework, which uses a grading rubric to evaluate responses, shares similarities with critique-based methods but focuses on evaluation rather than revision. While this approach produces chain-of-thought reasoning, it generally underperforms classical reward models on pairwise preference ranking tasks.
Researchers from Databricks, MIT, and the University of California, San Diego present Criticism out loud (Cloud) Reward models, which represent a unique approach to improving language model performance in reinforcement learning from human feedback. These models generate a detailed critique of how well an assistant's response answers a user's query before producing a scalar reward for the quality of the response. This process combines the strengths of classical reward models and the LLM framework as a judge.
Cloud reward models are trained using a preference dataset containing prompts, responses, and oracle reviews. The training process involves supervised fine-tuning of the oracle reviews for review generation and the Bradley-Terry preference model for scalar reward production. To improve performance, researchers explore multi-sample inference techniques, in particular self-consistency, which involves sampling multiple reward predictions from reviews and marginalizing across reviews for more accurate reward estimation.
This innovative approach aims to unify reward models and LLM-as-a-Judge methods, which could lead to significant improvements in pairwise preference classification accuracy and success rates across multiple benchmarks. The researchers also investigate key design choices such as on-policy versus off-policy training, and the benefits of self-consistency versus criticism for optimizing reward modeling performance.
Cloud reward models extend classical reward models by incorporating a language modeling module along with the base model and reward module. The training process involves supervised fine-tuning of oracle critiques, replacing them with auto-generated critiques, and then training the reward module with the auto-generated critiques. This approach minimizes the distribution shift between training and inference. The model uses modified loss functions, including a Bradley-Terry model loss and a critique-supervised fine-tuning loss. To improve performance, cloud models can employ self-consistency during inference, sampling multiple critiques for a fast response pair and averaging their predicted rewards for a final estimate.
The researchers evaluated cloud reward models against classical reward models using two key metrics: pairwise preference classification accuracy and Best-of-N (BoN) success rate. For pairwise preference classification, they used the RewardBench evaluation set, which includes categories such as Chat, Chat-Hard, Security, and Reasoning. The BoN success rate was evaluated using ArenaHard, an open-ended generation benchmark.
Cloud reward models significantly outperformed classical reward models in classifying pairwise preferences across all categories in RewardBench, for both the 8B and 70B model scales. This led to a substantial increase in the average accuracy of the cloud models.
In the BoN evaluation on ArenaHard, Cloud models demonstrated a Pareto improvement over classical models, resulting in equal or significantly higher win rates. In the best-of-16 case, Cloud improved the win rate by 1.84 and 0.89 percentage points for the 8B and 70B models, respectively. These results suggest that Cloud reward models provide superior performance in guiding the behavior of language models compared to classical reward models.
This study presents Cloud Reward Modelswhich represent a significant advancement in preference modeling for language models. By preserving language modeling capabilities along with a scalar reward core, these models explicitly reason about response quality through feedback generation. This approach demonstrates substantial improvements over classical reward models in pairwise preference modeling accuracy and Best-of-N decoding performance. Self-consistency decoding proved beneficial for reasoning tasks, particularly those with short reasoning horizons. By unifying language generation with preference modeling, cloud reward models establish a new paradigm that opens avenues for improving reward models through variable inference computing, laying the foundation for more sophisticated and effective preference modeling in language model development.
Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our Over 49,000 ML subscribers on Reddit
Find upcoming ai webinars here
Asjad is a consultant intern at Marktechpost. He is pursuing Bachelors in Mechanical Engineering from Indian Institute of technology, Kharagpur. Asjad is a Machine Learning and Deep Learning enthusiast who is always researching the applications of Machine Learning in the healthcare domain.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>