Introducing DataComp for Language Models (DCLM), a testbed for controlled dataset experiments to improve language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pre-training recipes based on the OpenLM framework, and a rich set of 53 post-evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data shuffling at model scales ranging from 412M to 7B parameters. As a foundation for DCLM, we perform extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline, enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the state-of-the-art open data language model, DCLM-Baseline represents a 6.6 percentage point improvement in MMLU while training with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B in MMLU (63% and 66%), and performs similarly on an average of 53 natural language understanding tasks while training with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for future research on data curation.