DataComp-LM: In search of the next generation of training sets for language models

Introducing DataComp for Language Models (DCLM), a testbed for controlled dataset experiments to improve language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pre-training recipes based on the OpenLM framework, and a rich set of 53 post-evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data shuffling at model scales ranging from 412M to 7B parameters. As a foundation for DCLM, we perform extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline, enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the state-of-the-art open data language model, DCLM-Baseline represents a 6.6 percentage point improvement in MMLU while training with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B in MMLU (63% and 66%), and performs similarly on an average of 53 natural language understanding tasks while training with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for future research on data curation.

DataComp-LM: In search of the next generation of training sets for language models

Technical Terrence Team

These small changes can maximize your retirement and 401(k) success

Leave a Reply Cancel reply

Recommended.

Southwest Airlines makes major change that passengers won’t like

How I would like to grow my Stocks and Shares ISA from £20,000 to £1 million

Towards responsible innovation: assessing risks and opportunities in open generative AI

3 reasons why the price of Ethereum has fallen against Bitcoin

Luxor and Bitnomial launch first Bitcoin hash rate futures

Categories

Important Links

DataComp-LM: In search of the next generation of training sets for language models

Related

Technical Terrence Team

These small changes can maximize your retirement and 401(k) success

Leave a Reply Cancel reply

Recommended.

Southwest Airlines makes major change that passengers won’t like

How I would like to grow my Stocks and Shares ISA from £20,000 to £1 million

Towards responsible innovation: assessing risks and opportunities in open generative AI

3 reasons why the price of Ethereum has fallen against Bitcoin

Luxor and Bitnomial launch first Bitcoin hash rate futures

Categories

Important Links

Get daily news updates to your inbox!