Machine learning (ML) offers enormous potential, from cancer diagnosis to safe autonomous car engineering to increase human productivity. However, to realize this potential, organizations need ML solutions to be reliable with ML solution development that is predictable and manageable. The key to both is a deeper understanding of ML data: how to design training data sets that produce high-quality models, and test data sets that provide accurate indicators of how close we are to solving the target problem.
The process of creating high-quality data sets is complicated and error-prone, from initial selection and cleaning of the raw data, to labeling the data and splitting it into training and test sets. Some experts believe that most of the effort in designing an ML system is actually sourcing and preparing data. Each step can introduce problems and biases. It has even been shown that many of the standard data sets we use today have mislabeled data that can destabilize established ML benchmarks. Despite the fundamental importance of data to ML, it is only now beginning to receive the same level of attention that learning models and algorithms have enjoyed for the past decade.
Towards this goal, we are introducing data throughput, a set of new data-centric ML challenges to advance state-of-the-art data selection, preparation, and acquisition technologies, designed and built through extensive collaboration between industry and academia. The initial release of DataPerf consists of four challenges focused on three common data-centric tasks across three application domains; Vision, speech and natural language (NLP) processing. In this blog post, we describe the dataset development bottlenecks that researchers face and discuss the role of benchmarks and leaderboards in incentivizing researchers to address these challenges. We invite innovators from academia and industry looking to measure and validate advances in data-centric machine learning to demonstrate the power of their algorithms and techniques to create and improve data sets through these benchmarks.
Data is the new bottleneck for ML
Data is the new code: it is the training data that determines the highest possible quality of an ML solution. The model only determines the degree to which that highest quality is realized; in a sense, the model is a lossy compiler for the data. Although high-quality training data sets are vital to continued advancement in the field of ML, much of the data on which the field relies today is nearly a decade old (for example, ImageNet either LibriSpeech) or pulled from the web with very limited content filtering (eg, LAION either The battery).
Despite the importance of data, ML research to date has been dominated by a focus on models. Before modern deep neural networks (DNNs), there were not enough ML models to match human behavior for many simple tasks. This initial condition led to a model-centric paradigm in which (1) the training data set and the test data set were “frozen” artifacts and the goal was to develop a better model, and (2) the test data set was randomly selected from the same pool of data than the training set for statistical reasons. Unfortunately, freezing the data sets ignored the ability to improve training accuracy and efficiency with better data, and the use of test sets drawn from the same pool as the training data was matched by a good fit of that data with the solution. of the underlying problem.
because now we are developing and deploy ML solutions for increasingly sophisticated tasks, we need to design test suites that fully capture real-world problems and training suites that, in combination with advanced models, provide effective solutions. We have to change today model-centric paradigm still data-centric paradigm in which we recognize that for most ML developers, creating high-quality training and test data will be a bottleneck.
Shift from the current model-centric paradigm to a data-centric paradigm enabled by quality datasets and data-centric algorithms like those measured in DataPerf. |
Enabling ML developers to create better training and test data sets will require a deeper understanding of ML data quality and the development of algorithms, tools, and methodologies to optimize it. We can start by recognizing common challenges in creating data sets and developing performance metrics for algorithms that address those challenges. For example:
- Data selection: Often we have a larger set of data available than we can effectively label or train on. How do we choose the most important data to train our models?
- Data cleaning: Human taggers sometimes make mistakes. ML developers can’t afford to have all tags reviewed and corrected by experts. How can we select the data most likely to be mislabeled for correction?
We can also create incentives that reward good dataset engineering. We anticipate that high-quality training data, which has been carefully selected and labeled, will become a valuable commodity in many industries, but we currently lack a way to assess the relative value of different data sets without actually training on the sets. of data in question. How do we solve this problem and enable quality-driven “data acquisition”?
DataPerf: the first leaderboard for data
We believe that good benchmarks and leaderboards can drive rapid progress in data-centric technology. ML benchmarks in academia have been essential in stimulating progress in the field. Consider the graph below showing progress on popular ML benchmarks (MNISTImageNet, Equipment, GLUE, Switchboard) over time:
Performance over time for popular benchmarks, normalized with starting performance at minus one and human performance at zero. (Fountain: Douwe, et al. 2021; used with permission.) |
Online leaderboards provide official validation of benchmark results and catalyze communities’ intent to optimize those benchmarks. For example, Kaggle has over 10 million registered users.. He MLPerf The official benchmark results have helped fuel a 16x improvement in training performance at key landmarks.
DataPerf is the first community and platform to create leaderboards for data benchmarks, and we hope to have a similar impact on data-centric ML research and development. The initial version of DataPerf consists of leaderboards for four challenges focused on three data-centric tasks (selection, cleaning, and data acquisition) across three application domains (vision, speech, and NLP):
- Selection of training data (Vision): Design a data selection strategy that chooses the best training set from a large candidate pool of weakly labeled training images.
- Training data selection (Voice): Design a data selection strategy that chooses the best training set from a large pool of automatically extracted speech clip candidates.
- Training Data Cleanup (Vision): Design a data cleaning strategy that chooses samples to relabel from a “noisy” training set where some of the labels are incorrect.
- Evaluation of training data sets (NLP): Quality data sets can be expensive to build and are becoming valuable products. Design a data acquisition strategy that chooses which training dataset to “buy” based on limited information about the data.
For each challenge, the DataPerf website provides design documents that define the problem, test patterns, quality objective, rules, and guidelines on how to run the code and submit it. Live leaderboards are hosted on the Dynabench platform, which also provides an online assessment framework and submission tracker. Dynabench is an open source project, hosted by the MLCommons Association, focused on enabling data-centric leaderboards for training and test data and data-centric algorithms.
How to take part
We are part of a community of ML researchers, data scientists, and engineers striving to improve data quality. We invite innovators from academia and industry to measure and validate data-centric algorithms and techniques to create and improve data sets through DataPerf benchmarks. The deadline for the first round of challenges is May 26, 2023.
Thanks
DataPerf benchmarks were created over the past year by engineers and scientists from: Coactive.ai, Eidgenössische Technische Hochschule (ETH) Zurich, Google, Harvard University, Meta, ML Commons, Stanford University. Furthermore, this would not have been possible without the support of the members of the DataPerf working group at Carnegie Mellon University, Digital Prism Advisors, Factored, Hugging Face, Institute for Human and Machine Cognition, Landing.ai, San Diego Supercomputing Center, Thomson Reuters Lab and TU Eindhoven.