A particular approach to large language models has been to improve their logical thinking and problem solving skills. Reinforcement learning (RL) is increasingly used in this space for massive models and compact versions that can work well in restricted computer environments. An important challenge in this field is to improve the reasoning capacity of a model without depending on extremely large infrastructure or excessive training time. The main models require expensive hardware and patented data pipes, which puts them out of reach of smaller laboratories or companies. This raises the question of whether smaller models can be improved using profitable approaches and achieve performance comparable to their larger counterparts in challenging tasks such as mathematical reasoning.
Several methods have been explored to address this. The thinking chain application helps guide models through problematic steps. Search algorithms, such as the search for the beam and the search for the tree of Monte Carlo, are also used to improve the logical flow of the answers. Reinforcement learning itself has been tested in multiple environments. However, many of these approaches are still obliged by the same problems: they depend on mass data sets or lead to unstable performance in small -scale configurations. In addition, the results often do not coincide with those of patented models such as the previous OPENAI O1.
The research introduced by a team of the Knovel engineering laboratory in Singapore and the University of Sciences of VNU in Vietnam focused on overcoming these problems. The researchers used a 1.5 billion parameter model called Depseek-R1-Distill-Qwen-1.5b. They adopted the group optimization algorithm of the group (GRPO) for configuration, training the model using four A40 NVIDIA GPU with 48 GB VRM each, all within a strict 24 -hour limit. Its key objective was to improve the reasoning of the model without a great financial or computational investment. Its training consumed only $ 42 in computer costs, a drastic reduction compared to the baselines that require thousands of dollars.
The team gathered a data set of 39,659 specific mathematics questions to achieve this by refining two existing data sets: open and open scale. The filtering process consisted of eliminating trivial or noisy questions using different models such as qwen2.5-7b-instruct and deepseek-r1-distill-qwen-1.5b. The rewards system was based on rules and focused on three components: correction of the responses (using box notation), structural format (tax with labels) and output length (rewarded with a cosine function to promote concise reasoning). The GRPO algorithm was used to test group responses and apply score -based optimization, avoiding the need for a critical model and, therefore, further reducing computational demands.

The performance of this approach was tested in five reference data sets: AMC23, Aime24, Math-500, Olympiadbench and Minerva. In an experiment, using only the Open-S1 data set, the AMC23 precision of the model improved from 63% to 70% within the first 100 global steps, but then decreased. In another essay that combined 7,000 samples of mixed difficulty, the accuracy in AMC23 increased to 80%and Aime24 reached 46.7%. The model called Open-RS2, trained in that configuration, also showed competitive scores in Olympiadbench (52.4%) and Math-500 (85%). In the final experiment, the cosine reward helped regulate the exit length to a range of 1000–3500 tokens, and the model maintained a accuracy of 72.5% in AMC23 and 84.4% in Math-500.
This research showed that effective reasoning in small language models can even be achieved with limited resources. The problem of training small models without a significant hardware investment was addressed with an efficient and low -cost training strategy. The proposed method used reinforcement learning and selected data to offer surprisingly strong results. With continuous improvements in the design of rewards and the stability of optimization, small models can soon rival their greatest counterparts in practical reasoning tasks.
Verify he Paper and Github page. All credit for this investigation goes to the researchers of this project. In addition, feel free to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter And don't forget to join our 85k+ ml of submen.

Nikhil is an internal consultant at Marktechpost. He is looking for a double degree integrated into materials at the Indian Institute of technology, Kharagpur. Nikhil is an ai/ML enthusiast who is always investigating applications in fields such as biomaterials and biomedical sciences. With a solid experience in material science, it is exploring new advances and creating opportunities to contribute.