Meet OREO (Offline Reasoning Optimization) – An Offline Reinforcement Learning Method to Improve LLM Multi-Step Reasoning

Large language models (LLMs) have demonstrated impressive proficiency in numerous tasks, but their ability to perform multi-step reasoning remains a major challenge. This limitation becomes particularly evident in complex scenarios such as solving mathematical problems, controlling embedded agents, and web browsing. Traditional reinforcement learning (RL) methods, such as proximal policy optimization (PPO), have been applied to address this problem, but they often entail high computational and data costs, making them less practical. Similarly, methods such as direct preference optimization (DPO), while effective at aligning models with human preferences, struggle with multi-step reasoning tasks. DPO's reliance on pairwise preference data and uniform token treatment undermines its ability to allocate credit effectively in low-reward situations. These obstacles highlight the need for more targeted and efficient solutions to improve LLM reasoning capabilities.

Introducing OREO: Offline Reasoning Optimization

OREO (Offline REasoning Optimization) is an offline RL approach specifically designed to address the shortcomings of existing methods to improve multi-step reasoning for LLMs. Developed collaboratively by researchers at UC San Diego, Tsinghua University, Salesforce Research, and Northwestern University, OREO is based on insights from maximum entropy reinforcement learning. Train a policy model and a value function simultaneously by optimizing the smooth Bellman equation. This methodology eliminates the reliance on pairwise preference data, allowing the use of unpaired data sets with sparse rewards. Additionally, OREO allows for the precise allocation of credits along reasoning trajectories, which is especially beneficial when success depends on a few critical steps. The framework can also be extended to iterative exploration settings and incorporates a learned value function to improve inference through tree search during testing.

Technical details and benefits

OREO's main innovation lies in optimizing the smooth Bellman equation to simultaneously train policy and value models. This strategy ensures accurate credit allocation at all reasoning steps, addressing the limitations of methods such as DPO. Additionally, OREO offers step-level and response-level objectives, providing flexibility for different granularities of reasoning tasks. During test-time inference, the value function supports advanced search techniques such as beam search, which improves accuracy. Unlike benchmark methods such as supervised fine-tuning (SFT) or rejection sampling, OREO excels at leveraging failed trajectories to improve model robustness and adaptability. This ability to learn from failure makes it particularly valuable for iterative, multi-step reasoning tasks.

Results and insights

OREO's performance has been rigorously evaluated on benchmarks such as GSM8K and MATH for mathematical reasoning, and ALFWorld for embedded agent control. Key findings include:

On GSM8K, OREO achieved a 5.2% relative improvement in accuracy using a 1.5 billion parameter model compared to SFT, along with a 10.5% improvement on MATH.
52.5% in MATHEMATICS with 1.5 billion LLM (without use of increased problem set)
On ALFWorld, OREO achieved a 17.7% relative improvement in performance in unseen environments, underscoring its ability to generalize beyond the training data.

Iterative training further amplified the effectiveness of OREO, showing consistent gains in accuracy across multiple iterations. While approaches such as rejection sampling showed diminishing returns, OREO continued to improve by incorporating insights from failed attempts. Searching at test time using OREO's value function resulted in a relative improvement of up to 17.9% over greedy decoding on the MATH dataset, highlighting its impact on inference quality.

Conclusion

OREO provides a practical and effective solution to improve multi-step reasoning in LLM through offline RL. By addressing the limitations of existing approaches, it offers a scalable method to improve reasoning capabilities. Its integration of detailed credit assignment, iterative training, and exam-time searching makes it a versatile tool for addressing complex reasoning challenges. The results demonstrate the potential of OREO for application in a variety of domains that require sophisticated problem solving, contributing to the evolution of artificial intelligence systems capable of deeper reasoning.

Verify he Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 60,000 ml.

Trending: LG ai Research launches EXAONE 3.5 – three frontier-level bilingual open-source ai models that deliver unmatched instruction following and broad context understanding for global leadership in generative ai excellence….

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. Their most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.

(Download) Large Language Model Vulnerability Assessment Report (Promoted)

Meet OREO (Offline Reasoning Optimization) – An Offline Reinforcement Learning Method to Improve LLM Multi-Step Reasoning

Technical Terrence Team

Starbucks strike will expand to more than 300 US stores on Christmas Eve, union says By Reuters

Leave a Reply Cancel reply

Recommended.

Bitcoin Investors Place $500 Million in BTC Spot Offers Below Market Price

Women and Gen Z Lead Cryptocurrency Trading in Singapore, Study Reveals

RA-ISF: An Artificial Intelligence Framework Designed to Enhance Recall Augmentation Effects and Improve Performance in Open Domain Question Answering

2 small cap stocks that I think could boost investor returns!

Can Ripple Control The XRP Price? Crypto Analysts Weigh In

Categories

Important Links

Meet OREO (Offline Reasoning Optimization) – An Offline Reinforcement Learning Method to Improve LLM Multi-Step Reasoning

Introducing OREO: Offline Reasoning Optimization

Technical details and benefits

Results and insights

Conclusion

Related

Technical Terrence Team

Starbucks strike will expand to more than 300 US stores on Christmas Eve, union says By Reuters

Leave a Reply Cancel reply

Recommended.

Bitcoin Investors Place $500 Million in BTC Spot Offers Below Market Price

Women and Gen Z Lead Cryptocurrency Trading in Singapore, Study Reveals

RA-ISF: An Artificial Intelligence Framework Designed to Enhance Recall Augmentation Effects and Improve Performance in Open Domain Question Answering

2 small cap stocks that I think could boost investor returns!

Can Ripple Control The XRP Price? Crypto Analysts Weigh In

Categories

Important Links

Get daily news updates to your inbox!