Machine learning models, especially those designed for code generation, rely heavily on high-quality data during pre-training. This field has seen rapid advancement, with large language models (LLMs) trained on extensive datasets containing code from various sources. The challenge for researchers is to ensure that the data used is abundant and of high quality, as this significantly impacts the model’s ability to handle complex tasks. In code-related applications, well-structured, annotated, and clean data ensures that models can generate accurate, efficient, and reliable results for real-world programming tasks.
A major problem in developing code models is the lack of precise definitions of “high-quality” data. While large amounts of code data are available, much of it contains noise, redundancy, or irrelevant information, which can degrade model performance. Relying on raw data, even after filtering, often leads to inefficiencies. This problem becomes apparent when models trained on large data sets underperform on practical benchmarks. To address this, increased emphasis has been placed on not only acquiring large amounts of data, but also selecting data that aligns well with downstream applications, improving the model’s predictive capabilities and overall utility.
Historically, pretraining code models involved scraping information from large repositories like GitHub and processing raw data using basic filtering and deduplication techniques. Researchers then applied random forest classifiers or simple quality filters to identify code with educational value, as seen in models like Phi-1. While these methods improved data quality to some extent, they were not sufficient to achieve optimal performance on more challenging coding tasks. Newer approaches have adopted more sophisticated tools, such as BERT-based annotators, to classify code quality and select data that would most effectively contribute to model success.
The research team from Snowflake ai Research, the University of Illinois at Urbana-Champaign, and Seoul National University presented Arctic Snow Encoder 1.3B A new approach to pretraining code models by progressively refining data quality in three distinct phases. This method combines general pretraining, continuous pretraining with high-quality data, and final pretraining with synthetic data. The researchers leveraged existing datasets such as The Stack v1 and GitHub traces and artificial data generated with Llama-3.1-70B to build a smaller, more efficient model. This process focused on optimizing the data used in each phase to ensure the model could outperform its competitors.
In the first phase, Arctic-SnowCoder was trained with 500 billion code tokens derived from raw data sources such as The Stack v1 and GitHub. This data underwent basic preprocessing steps, including filtering and deduplication, resulting in approximately 400 billion unique tokens. During this phase, the model was trained without advanced quality filters, and the data was grouped by programming language and repository. This approach ensured a broad code knowledge base but required further refinement. In the second phase, the research team selected 50 billion tokens from this initial dataset, focusing on high-quality data. A BERT-based quality annotator was employed to classify the code files, and the top 12.5 billion tokens were iterated four times to further train the model. This phase significantly improved the quality of the data, as the annotator was specifically trained to select tokens aligned with subsequent applications of the model.
The final phase involved enhanced pre-training using 5 billion synthetic tokens generated by Llama-3.1-70B. These tokens were created using the high-quality data from phase two as seeds, transforming the lower-quality data into high-quality synthetic documents. This phase further refined the model’s ability to generate accurate code by ensuring that the training data was relevant and representative of real-world coding tasks. The result was a model that had undergone progressively more rigorous training, with each phase contributing to improved performance.
The effectiveness of this approach is evident in the results for Arctic-SnowCoder-1.3B. Despite being trained on just 555 billion tokens, it significantly outperformed other similarly sized models, such as Phi-1.5-1.3B and StarCoderBase-3B, which were trained on over 1 trillion tokens. On the BigCodeBench benchmark, which focuses on practical and challenging programming tasks, Arctic-SnowCoder outperformed Phi-1.5-1.3B by 36%. It outperformed StarCoder2-3B, trained on over 3 trillion tokens, on HumanEval+, achieving a score of 28.0 compared to StarCoder2-3B’s 27.4. Despite being trained on fewer tokens, the model’s ability to perform well highlights the importance of data quality over quantity.
In conclusion, Arctic-SnowCoder-1.3B illustrates the critical role of progressively refined high-quality data in the pre-training of code models. By adopting a three-phase approach, the researchers significantly improved model performance compared to larger models trained with many more tokens. This method demonstrates the importance of aligning pre-training data with downstream tasks and provides practical guidelines for future model development. The success of Arctic-SnowCoder is a testament to the value of high-quality data, showing that careful data curation and synthetic data generation can lead to substantial improvements in code generation models.
Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and LinkedInJoin our Telegram Channel.
If you like our work, you will love our fact sheet..
Don't forget to join our SubReddit of over 50,000 ml
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary engineer and entrepreneur, Asif is committed to harnessing the potential of ai for social good. His most recent initiative is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has over 2 million monthly views, illustrating its popularity among the public.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>