The critical issue of restricted access to high quality reasoning data sets has limited the advances of logical and mathematical reasoning promoted by the ai of open source. Although patented models have taken advantage of the demonstrations of structured reasoning to improve performance, these data sets and methodologies remain closed, restricting independent research and innovation. The lack of open and scalable reasoning data sets has created a bottleneck for the development of ai.
In recent years, models such as Skyt1, Still-2 and Deepseek-R1 have shown that a relatively small set of high quality reasoning demonstrations in hundreds of thousands can substantially improve the capacity of a model to perform logical and mathematical reasoning tasks complex. Even so, most reasoning data sets and methodologies behind their creation remain property, which limits access to crucial resources necessary for greater exploration in the field.
The open thoughts initiativeDirected by Bespoke Labs and the Datacomp community of Stanford, UC Berkeley, UT Austin, UW, UCLA, UNC, TRI and Laion, it is an ambitious open source project with the aim of healing and developing high quality reasoning data to address the previous data sets concerns with the availability of data sets. This project seeks to establish the best open reasoning data sets to improve the cognitive abilities of language models. The team aims to provide the data sets and data generation strategies available publicly. In this effort, they have launched the Open Houthts-114K Set of Reasoning Data and the Associate OPENTHINKER-7B model. Let's see the details of both one by one.
The OPenthoughts-114K data set: a new standard in open reasoning data
This data set was designed to provide a large -scale and high quality corpus of reasoning demonstrations to improve the reasoning skills of language models. OPENTHOUGHTS-114K is an extension of previous data sets such as Bespoke-Stratos-17K, which only contained 17,000 examples. By expanding up to 114,000 examples of reasoning, this data set has improved performance at various reasoning points. OPENTHOUGHTS-114K was generated using reasoning distillation techniques inspired by Depseek-R1, which showed that demonstrations of synthetic reasoning could occur efficiently. This data set incorporates various reasoning challenges, ranging from the resolution of mathematical problems to logical deduction, thus serving as a valuable resource to improve the robustness of the model in multiple domains of reasoning.
OPENTHINKER-7B: A model for advanced reasoning
Together with the launch of Openthoughts-114K, the Open Pensionings team also presented Openthinker-7B, an adjusted version of QWEN-2.5-7B-Instruct. This model was specifically trained in Opents-114K and substantially improved on its predecessors. More than 20 hours, it was trained with four 8xh100 nodes. He was trained using the Transformers 4.46.1 and Pytorch 2.3.0 library to guarantee compatibility with widely used ML frames.
In some reasoning tasks, Openthinker-7B surpass comparable models such as Bespoke-Stratos-7B, Deepseek-R1-Distill-QWen-7B and even GPT-4O. Benchmarked using evaluation, demonstrated impressive results in data sets such as Aime24: 43.3%, Math500: 83.0%, GPQA-D: 42.4%, Easy LCB: 75.3%and a half LCB: 28.6%. These results indicate that Openthinker-7B is a formidable source alternative open to patented reasoning models.
Fully open source: pesos, data and code
A defining characteristic of the open thoughts project is its commitment to full transparency. Unlike patented models such as GPT-4O and O1-mini, which maintain their closed data and training methodologies, OpenThinker-7B and OpenThoughts-114K are completely open source. This means:
- Pesos of the open model: The weights of the OpenThinker-7B model are publicly accessible, which allows researchers and developers to adjust and take advantage of the model.
- Open data: The OPENTHOUGHTS-114K data set is available for free for anyone to use, modify and expand.
- Open source: Data generation, evaluation and training code for Openthinker-7B are housed in Github, ensuring complete transparency and reproducibility.
The open thoughts project is alone in its early stages, with plans for greater expansion. Some possible future addresses include:
- Future optimistic iterations could incorporate millions of reasoning examples, covering a broader spectrum of cognitive challenges.
- OPENTHINKER-7B is an excellent starting point, but the largest adjusted models in even more data could further push the limits of reasoning capabilities.
- Encourage more researchers, engineers and artificial intelligence enthusiasts to contribute to the creation of data data, models training and evaluation methodologies.
In conclusion, open thoughts represent a transformative effort to democratize the reasoning of ai. When launching Openthoughts-114K and Openthinker-7B as open source resources, the project empowers the ai community with high quality data and models to advance reasoning research. With continuous collaboration and expansion, open thoughts have the potential to redefine how ai addresses logical, mathematical and cognitive reasoning tasks.
Sources
<figure class="wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter“>
Besides, don't forget to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LINKEDIN GRsplash. Do not forget to join our 70k+ ml of submen.
Know Intellagent: A framework of multiple open source agents to evaluate a complex conversational system (Promoted)
Sana Hassan, a consulting intern in Marktechpost and double grade student in Iit Madras, passionate to apply technology and ai to address real world challenges. With great interest in solving practical problems, it provides a new perspective to the intersection of ai and real -life solutions.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>