Advanced conversational models like ChatGPT and Claude are causing significant changes in various products and everyday life. The key factor contributing to its success lies in the strength of the fundamental linguistic model. State-of-the-art fundamental models are typically pre-trained using large, diverse, high-quality datasets spanning diverse sources such as Wikipedia, scientific articles, community forums, Github repositories, web pages, and more. These fundamental language models are expected to possess comprehensive capabilities, including language understanding, common sense reasoning, mathematical reasoning, language generation, and more.
A new study by Shanghai Jiao Tong University, Shanghai artificial intelligence Laboratory, Nanjing University of Science and technology, and Generative ai Research Laboratory (GAIR) focuses on improving mathematical reasoning capabilities within fundamental language models, which could potentially improve applications in educational tools. automated problem solving, data analysis, code programming and ultimately improving user experience. Instead of directly building a model, the goal is to create a diverse, high-quality pre-training dataset designed specifically for the mathematics domain, MATHPILE.
This approach stands out from previous work in several ways. Previous open source pretraining datasets have generally focused on general domains (e.g., Pile, RedPajama, Dolma), multilingual aspects, or programming languages (e.g., ROOTS and The Stack), and lacked of a corpus specifically designed for mathematics. Although some datasets are designed to train math-specific language models (for example, Minerva's math training dataset and OpenAI's MathMix), these are not openly available.
Recognizing this gap, this work aims to close this gap by developing an open source mathematical corpus, democratizing access to high quality mathematical data. This initiative allows researchers and developers to effectively and inclusively advance the capabilities of language models in mathematical reasoning. In terms of diversity, the corpus goes beyond web pages, integrating top-level mathematics textbooks, lecture notes, scientific articles from arXiv, and carefully curated content from authoritative platforms such as StackExchange, ProofWiki, and Wikipedia. This positions the corpus as a richer and more varied mathematical resource for language models.
The researchers emphasize high quality due to recent studies highlighting the adverse effects of repetitive, low-quality content in pre-training data sets on model training. For example, the creation of a code-centric model of 1.3 billion parameters was achieved through pre-training on carefully selected web pages and synthetic textbooks. It is emphasized that the quality of the corpus is more crucial than its quantity. To achieve this, researchers conducted extensive preprocessing, cleaning, filtering, and deduplication efforts, committed to continuous refinement and optimization to distinctively contribute to mathematics.
The team highlights that transparency and documentation are key aspects. Thoroughly documenting large-scale pre-training data sets is crucial to identifying biases or problematic content. MATHPILE provides comprehensive documentation, including features, intended uses, and efforts to remove bias or unwanted content to improve trust and usability among professionals.
This initiative aims to foster the growth of ai in mathematics by offering a specialized, diverse and high-quality corpus adapted to the mathematical domain, while maintaining absolute data transparency for professionals. The team hopes their work will help lay the foundation for training more powerful mathematical problem-solving models in the future.
Review the Paper, Projectand GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to join. our SubReddit of more than 35,000 ml, 41k+ Facebook community, Discord channel, LinkedIn Graboveand Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you'll love our newsletter.
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today's evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>