Mutual Self-Playing Reasoning (rStar): A new AI approach that boosts the reasoning ability of small language models during inference without fine-tuning

Large language models (LLMs) have made significant progress in various applications, but they continue to face substantial challenges in complex reasoning tasks. For example, even state-of-the-art models such as Mistral-7B can only achieve 36.5% accuracy on the GSM8K dataset, despite employing techniques such as Chain-of-Thought (CoT). While fine-tuning has shown promise in improving reasoning capabilities, most LLMs rely on data distilled or synthesized by superior models such as GPT-4. This reliance on more advanced models has led researchers to explore alternative approaches to improve reasoning without relying on a superior teaching LLM. However, this endeavor comes with its challenges, particularly for smaller language models (SLMs), which need help with effective solution space exploration and assessing the quality of reasoning steps.

Researchers have made several attempts to improve the reasoning capabilities of language models. Cue-based methods such as Chain-of-Thought focus on the design of instructions and sequences to improve performance during inference. These approaches include planning, problem decomposition, abstraction, and scheduling techniques. In addition, self-improvement methods have gained ground, with fine-tuning approaches using pre-trained LLMs to synthesize data and improve performance progressively. Advanced cuing techniques such as self-verification and RAP aim to improve performance through iterative self-exploration. Sampling from diverse reasoning paths has shown promise in mathematical reasoning tasks, with methods such as self-consistency and tree search approaches breaking tasks down into simpler steps. For answer verification, majority voting is widely used, while some researchers have explored training or reward value models, although these require additional annotations and run the risk of overfitting.

Researchers from Microsoft Research Asia and Harvard University presented the Mutual reasoning based on own game (rStar) Monte Carlo tree search approach, a robust solution to improve the reasoning capabilities of SLMs during inference, without relying on fine-tuning or superior models. rStar addresses the challenges faced by SLMs through a unique self-playing mutual discrimination-generation process. This method employs a conventional Monte Carlo tree search (MCTS) for self-generated reasoning steps, but extends the set of reasoning actions to simulate human reasoning behaviors. These actions include problem decomposition, searching for specific reasoning steps, proposing new subquestions, and reformulating given questions. To guide the exploration of effectively generated reasoning trajectories, rStar introduces a discrimination process called mutual consistency, which employs a second SLM as a discriminator to provide unsupervised feedback on candidate reasoning trajectories.

The rStar method employs a unique architecture to enhance the reasoning capabilities of SLMs. At its core, rStar uses an MCTS algorithm to augment the target SLM and generate multi-step reasoning solutions. The method introduces a comprehensive set of five human-like reasoning actions, including proposing one-step thoughts, generating the remaining thought steps, proposing and answering sub-questions, re-answering sub-questions, and rephrasing questions. This diverse action space enables comprehensive exploration across various reasoning tasks.

rStar implements a carefully designed reward function that evaluates the value of each action without relying on self-reward techniques or external supervision. The MCTS implementation process uses the Upper Confidence Bounds on Trees (UCT) algorithm to balance exploration and exploitation during tree expansion. To verify the generated reasoning trajectories, rStar introduces a second SLM as a discriminator, employing a mutual consistency approach. This process involves masking part of a candidate trajectory and asking the discriminating SLM to complete it, then comparing the results for consistency.

The results demonstrate the effectiveness of rStar on several reasoning benchmarks and language models:

1. Performance in various reasoning tasks:

rStar significantly improved the problem-solving capability of SLMs. For example, the accuracy of LLaMA2-7B on GSM8K increased from 12.51% with few-shot CoT to 63.91% with rStar, almost matching the tuned performance.
rStar consistently improved reasoning accuracy across different SLMs and tasks to state-of-the-art levels, outperforming other baseline approaches.
Even without the discriminator, the rStar generator outperformed existing multi-round inference baselines such as RAP, ToT, and self-consistency on GSM8K.

2. Efficiency:

rStar showed significant improvements in reasoning accuracy with only 2 implementations on the GSM8K dataset.

3. Performance on challenging mathematical datasets:

On GSM-Hard and MATH-500, rStar significantly improved the reasoning accuracy of SLMs, with improvements of up to 12.9% and 9.14% respectively compared to the state-of-the-art baselines.

4. Ablation studies:

The MCTS generator in rStar outperformed other approaches such as RAP and self-consistency on different models and tasks.
The rStar discriminator consistently outperformed other verification methods, including majority voting and self-verification, on different generators.

5. Model comparisons:

Different models were tested as discriminators, and GPT-4 achieved the highest accuracy (92.57%) on GSM8K, followed by Phi3-Mini-Instruct (91.13%).

These results highlight the effectiveness of rStar in improving the reasoning capabilities of SLMs on various tasks and models, outperforming existing methods in both accuracy and efficiency.

He rStar This approach introduces a robust generator-discriminator self-matching method that significantly improves the reasoning capabilities of language models during inference. This research reveals that language models such as LLaMA2-7B possess strong inherent reasoning capabilities even before domain-specific supervised fine-tuning. rStar demonstrates state-of-the-art performance on five different language models and five diverse reasoning tasks, substantially outperforming existing multi-round boosting and self-improvement techniques. The extensive ablation studies and analysis performed in this research contribute valuable insights to the field, paving the way for more advanced self-improvement reasoning techniques in language models. These findings highlight the potential of rStar to unlock the latent reasoning capabilities of language models without the need for extensive fine-tuning or reliance on larger models.

Take a look at the PaperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..

Don't forget to join our Subreddit with over 48 billion users

Find upcoming ai webinars here

Asjad is a consultant intern at Marktechpost. He is pursuing Bachelors in Mechanical Engineering from Indian Institute of technology, Kharagpur. Asjad is a Machine Learning and Deep Learning enthusiast who is always researching the applications of Machine Learning in the healthcare domain.

Mutual Self-Playing Reasoning (rStar): A new AI approach that boosts the reasoning ability of small language models during inference without fine-tuning

Technical Terrence Team

Carnival Cruise Line Addresses Key Controversy on Pool Deck

Leave a Reply Cancel reply

Recommended.

Mastercard Crypto Credential launches with "First pilot peer-to-peer transactions"

Ideas from the pioneers for new Web3 professionals

Why do Glencore's actions hate me?

Optimizing School Technology Budgets Through Data-Driven Decisions

JBL has just announced two new portable speakers that support AI Sound Boost technology

Categories

Important Links

Mutual Self-Playing Reasoning (rStar): A new AI approach that boosts the reasoning ability of small language models during inference without fine-tuning

Related

Technical Terrence Team

Carnival Cruise Line Addresses Key Controversy on Pool Deck

Leave a Reply Cancel reply

Recommended.

Mastercard Crypto Credential launches with "First pilot peer-to-peer transactions"

Ideas from the pioneers for new Web3 professionals

Why do Glencore's actions hate me?

Optimizing School Technology Budgets Through Data-Driven Decisions

JBL has just announced two new portable speakers that support AI Sound Boost technology

Categories

Important Links

Get daily news updates to your inbox!