Language models (LM) have progressed significantly through the increase in computational power during training, mainly through large-scale self-supervised pretruation. While this approach has produced powerful models, a new paradigm called trial time scale has emerged, focusing on improving performance by increasing calculation at the time of inference. Openai's O1 model has validated this approach, showing improved reasoning capabilities through the test time computing scale. However, replicating these results has proven to be challenging, with several attempts that use techniques such as the search for trees from Monte Carlo (MCTS), multiple agents approaches and reinforcement learning. Even models such as Deepseek R1 have used millions of complex training samples and training stages, however, none has replicated scale behavior in the test time in O1.
Several methods have been developed to address the challenge of test time scale. Sequential scale approaches allow models to generate successive solution attempts, and each iteration is based on previous results. Tree -based search methods combine sequential and parallel scale, implementing techniques such as MCTS and search for guided beam. Overfase has become a remarkable approach, using a process reward model to optimize the search for trees through balanced exploitation and pruning, which shows superior performance compared to the methods and MCT based on sampling. These approaches depend largely on the reward models, which come in two ways: models of results rewards to evaluate complete solutions in the best selection of N and process reward models to evaluate the individual reasoning steps in the methods of Tree -based search.
Researchers at Stanford University, the University of Washington, the Allen Institute for ai and ai contextual have proposed a simplified approach to achieve the test time scale and improved reasoning capabilities. Its method focuses on two key innovations: the carefully selected S1K data set that includes 1,000 questions with reasoning traces, selected based on the difficulty, diversity and quality criteria, and a novel technique called Forzing Budget. This budget forgery mechanism controls the calculation of the trial time by shortening or extending the model thinking process through strategic “wait” insertions, which allows the model to review and correct its reasoning. The approach was implemented by adjusting the QWEN2.5-32B instruction language model in the S1K data set.
The data selection process follows a three -stage filtering approach based on quality, difficulty and diversity criteria. The quality filtering stage begins by eliminating samples with API errors and format problems, reducing the initial data set to 51,581 examples, of which 384 high quality samples are initially selected. The difficulty evaluation uses two key metrics: model performance evaluation using the QWEN2.5-7B-Instruct and QWEN2.5-32B-Instructment models, with the correction verified by the CLAUDE 3.5 sonnet, and the length of reasoning trace Measured by the Qwen2.5 token. For diversity, the questions are classified as specific domains using the mathematical subject classification system through the Claude 3.5 sonnet. This comprehensive filtering process results in a final data set of 1,000 samples that cover 50 domains.
The S1-32B model demonstrates significant improvements in performance through the test calculation scale with budget forcing. S1-32B works in a paradigm of upper scale compared to the base QWEN2.5-32b-instrument model using the majority vote, validating the effectiveness of the sequential scale on parallel approaches. In addition, S1-32B arises as the most efficient open data reasoning model in sample efficiency, which shows a remarkable improvement on the base model with only 1,000 additional training samples. While R1-32B achieves better performance, it requires 800 times more training data. In particular, S1-32B approaches the performance of Gemini 2.0 Thinking in Aime24, which suggests a successful distillation of knowledge.
This article shows that the supervised fine adjustment (SFT) with only 1,000 carefully selected examples can create a competitive reasoning model that coincides with the O1 forecast performance and achieves optimal efficiency. The introduced budget forcing technique, when combined with the reasoning model, successfully reproduces the opening time of OpenAI test scale. The effectiveness of such minimum training data suggests that model reasoning capabilities are largely present from previously elaborated in billion tokens, with the fine adjustment process that simply activates these latent skills. This is aligned with the “surface alignment hypothesis” of Lima's research, suggesting that a relatively small number of examples can effectively align the behavior of a model with the desired results.
Verify he Paper and Github page. All credit for this investigation goes to the researchers of this project. Besides, don't forget to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LINKEDIN GRsplash. Do not forget to join our 75K+ ml of submen.
Recommended open source ai platform: “Intellagent is a multiple open source agent frame to evaluate the complex conversational system” (promoted)

Sajad Ansari is an undergraduate last year of Iit Kharagpur. As an enthusiastic of technology, it deepens the practical applications of ai with an approach to understanding the impact of ai technologies and their implications of the real world. Its objective is to articulate concepts of complexes clearly and accessible.