Compositional GSM: A New AI Benchmark for Assessing the Reasoning Capabilities of Large Language Models in Multi-Step Problems

Natural language processing (NLP) has seen rapid advances, and large language models (LLMs) are used to address various challenging problems. Among the various applications of LLMs, mathematical problem solving has become a benchmark for evaluating your reasoning abilities. These models have demonstrated notable performance on math-specific benchmark tests such as GSM8K, which measures their abilities to solve elementary school math problems. However, there is an ongoing debate about whether these models actually understand mathematical concepts or exploit patterns within the training data to produce correct answers. This has led to the need for further assessment to understand the extent of their reasoning abilities in handling complex and interconnected types of problems.

Despite their success on existing mathematical benchmarks, the researchers identified a critical problem: most LLMs need to exhibit consistent reasoning when faced with more complex composition questions. While standard benchmarks involve solving individual problems independently, real-world scenarios often require understanding the relationships between multiple problems, where the answer to one question must be used to solve another. Traditional evaluations do not adequately represent these scenarios, which focus only on solving isolated problems. This creates a discrepancy between the high benchmark scores and the practical usability of LLMs for complex tasks that require step-by-step reasoning and deeper understanding.

Researchers from Mila, Google DeepMind and Microsoft Research have introduced a new assessment method called “Grade School Compositional Mathematics (GSM).” This method involves chaining together two separate math problems so that the solution to the first problem becomes a variable in the second problem. Using this approach, researchers can analyze the capabilities of LLMs to handle dependencies between questions, a concept that should be adequately captured by existing benchmarks. The compositional GSM method offers a more complete assessment of LLMs' reasoning capabilities by introducing linked problems that require the model to transport information from one problem to another, making it necessary to solve both correctly for a successful outcome.

The evaluation was carried out using a variety of LLMs, including open-weight models such as LLAMA3 and closed-weight models such as the GPT and Gemini families. The study included three test suites: the original GSM8K split test suite, a modified version of GSM8K in which some variables were substituted, and the new compositional GSM test suite, each containing 1200 examples. The models were tested using an 8-shot prompting method, where they were given several examples before being asked to solve composition problems. This method allowed the researchers to compare the performance of the models holistically, considering their ability to solve problems individually and in a compositional context.

The results showed a considerable gap in reasoning ability. For example, cost-effective models like the GPT-4o mini showed a 2 to 12 times worse reasoning gap in compositional GSM compared to their performance in the standard GSM8K. Furthermore, specialized mathematics models such as Qwen2.5-MATH-72B, which achieved more than 80% accuracy in high school competitive level questions, could only solve less than 60% of school composition mathematics problems. primary. This substantial drop suggests that more than specialized training in mathematics is needed to adequately prepare models for multi-step reasoning tasks. Furthermore, it was observed that models such as LLAMA3-8B and Mistral-7B, despite achieving high scores on isolated problems, showed a sharp drop when it was required to link answers between related problems.

The researchers also explored the impact of instruction tuning and code generation on model performance. Instruction tuning improved results for smaller models on standard GSM8K problems, but only led to minor improvements in GSM composition. Meanwhile, generating code solutions instead of using natural language resulted in an improvement from 71% to 149% for some smaller models in compositional GSM. This finding indicates that while code generation helps reduce the reasoning gap, it does not eliminate it, and systematic differences in reasoning abilities between different models persist.

Analysis of reasoning gaps revealed that the performance drop was not due to leaks in the test set, but rather to distractions caused by additional context and poor reasoning on the second jump. For example, when models like the LLAMA3-70B-IT and Gemini 1.5 Pro needed to solve a second question using the answer from the first, they often needed to apply the solution precisely, resulting in incorrect final answers. This phenomenon, known as the second-hop reasoning gap, was more pronounced in smaller models, which tended to miss crucial details when solving complex problems.

The study highlights that current LLMs, regardless of their performance on standard benchmarks, still struggle with compositional reasoning tasks. The Compositional GSM benchmark introduced in the research provides a valuable tool for assessing LLMs' reasoning abilities beyond isolated problem solving. These results suggest that more robust training strategies and baseline designs are needed to improve the compositional capabilities of these models, allowing them to perform better in complex problem-solving scenarios. This research highlights the importance of reevaluating existing evaluation methods and prioritizing the development of models capable of multi-step reasoning.

look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 50,000ml

Are you interested in promoting your company, product, service or event to over 1 million ai developers and researchers? Let's collaborate!

Nikhil is an internal consultant at Marktechpost. He is pursuing an integrated double degree in Materials at the Indian Institute of technology Kharagpur. Nikhil is an ai/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in materials science, he is exploring new advances and creating opportunities to contribute.

Compositional GSM: A New AI Benchmark for Assessing the Reasoning Capabilities of Large Language Models in Multi-Step Problems

Technical Terrence Team

Israel stock markets close lower; The TA 35 drops 0.23% By Investing.com

Leave a Reply Cancel reply

Recommended.

How to Outsmart Fake News in Your Facebook Feed

Ethereum Investors See Fresh Surge Amid Cooldown Phase: Data

Manta Foundation launches a $50 million ecosystem fund

Best Ways to Pay Contractors for Your Business

Gamium native token price jumps after Meta and Telefónica partnership

Categories

Important Links

Compositional GSM: A New AI Benchmark for Assessing the Reasoning Capabilities of Large Language Models in Multi-Step Problems

Related

Technical Terrence Team

Israel stock markets close lower; The TA 35 drops 0.23% By Investing.com

Leave a Reply Cancel reply

Recommended.

How to Outsmart Fake News in Your Facebook Feed

Ethereum Investors See Fresh Surge Amid Cooldown Phase: Data

Manta Foundation launches a $50 million ecosystem fund

Best Ways to Pay Contractors for Your Business

Gamium native token price jumps after Meta and Telefónica partnership

Categories

Important Links

Get daily news updates to your inbox!