Recent advances in LLM have significantly improve their reasoning capabilities, particularly through Fine RL -based adjustment. Initially trained with supervised learning for tokens prediction, these models are subjected to RL after training, exploring several reasoning routes to reach the right answers, similar to the way an agent is navigated by a game. This process leads to emerging behaviors such as self -correction, often called “AHA moment”, where models begin to review their errors without explicit instructions. While this improves precision, it also results in much longer responses, increasing the use of tokens, computational costs and latency. Despite the assumptions that the longest results are equivalent to better reasoning, research shows mixed results: some improvements are observed, but excessively long responses can also reduce performance, indicating decreasing yields.
Researchers are exploring ways to balance the quality and efficiency of reasoning to address this. The methods include the use of smaller and faster models, apply rapid engineering to reduce verbosity and develop rewards techniques that foster concise but effective reasoning. A remarkable approach is long distillation, where models learn from detailed explanations and are trained to produce shorter but precise answers. Using these techniques, models such as Kimi have demonstrated competitive performance even against larger models such as GPT-4 while consume less tokens. Studies also highlight the concept of “complexity of Token”, which show that problems require a minimal token threshold for precise resolution, and rapid strategies aimed at conciseness often do not reach this optimal point. In general, findings emphasize the importance of developing more efficient reasoning methods without compromising performance.
Wand ai researchers challenge the belief that the longest responses inherently lead to better reasoning in large language models. Through the theoretical analysis and experiments, they show that this verbosity is a byproduct of RL optimization instead of a need for precision. Interestingly, concise responses often correlate with greater correction, and the correct answers are shorter than the incorrect ones. They propose a two -phase RL training approach: the first phase improves the reasoning capacity, while the second enforces the conciseness using a small data set. This method reduces the length of the response without sacrificing precision, offering better efficiency and performance with a minimum computational cost.
Longer answers do not always lead to better performance in language models. RL After training tends to reduce the length of the response while maintaining or improving precision, especially early in training. This counteracts the belief that long reasoning chains are necessary for correction. The phenomenon is linked to “dead”, where excessively long results run by running the risk of diverting the course. The analysis of language tasks as Markov's decision processes reveals that RL minimizes the loss, not length, and the longest outputs only arise when the rewards are consistently negative. A two -phase RL strategy, first in hard problems, then in the solutions, can promote reasoning while finally promoting conciseness and robustness.
The two -phase RL strategy led to notable performance profits in different model sizes. Training at different levels of difficulty showed that the easiest problems helped models shorten the answers while maintaining or improving precision. A second RL phase that uses only eight mathematical problems produced more concise and robust results in reference points such as AIIME, AMC and Math-500, with similar trends in Stem MMU tasks. Even the minimum RL precision after training improved and low temperature sampling stability. In addition, previous RL models, such as QWEN-MMH-V2.5, showed large precision increases, up to 30% of training in just four mathematical problems.
In conclusion, the study presents a two-phase RL post-LEAD METHOD that improves reasoning and conciseness in language models. The first phase improves precision, while the second focuses on shortening the answers without sacrificing performance. Applied to R1 models, this approach reduced the response length in more than 40% while precision is maintained, especially at low temperatures. The findings reveal that the longest answers are not inherently better and that RL directed can achieve concise reasoning. The study also emphasizes that even RL's minimal training can greatly benefit the non -condition models, emphasizing the value of including moderately solveing problems and carefully adjusting PPO parameters.
Verify he Paper. All credit for this investigation goes to the researchers of this project. In addition, feel free to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter And don't forget to join our 85k+ ml of submen.
) Recommended Reading) Boson ai presents Higgs audio understanding and Higgs audio generation achieving main scores (60.3 average in Airbench Foundation) with its reasoning improvements (sponsored)

Sana Hassan, a consulting intern in Marktechpost and double grade student in Iit Madras, passionate to apply technology and ai to address real world challenges. With great interest in solving practical problems, it provides a new perspective to the intersection of ai and real -life solutions.