*=Equal taxpayers
Multimodal datasets are a critical component in recent advances such as Stable Diffusion and GPT-4, but their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduced DataComp, a testbed for dataset experiments focused on a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or select new data sources and then evaluate their new data set by running our standardized CLIP training code and testing the resulting model on 38 subsequent test sets. Our benchmark consists of multiple computational scales spanning four orders of magnitude, allowing the study of scale trends and making the benchmark accessible to researchers with varying resources. Our benchmark experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, allows training a ViT-L/14 CLIP from scratch with a zero-shot accuracy of 79.2% on ImageNet, outperforming OpenAI's ViT-L/14 CLIP by 3 .7 percentage points while using the same training procedure. and calculate.