Meta and MLCommons Researchers Propose DataPerf: The First Platform for Building Data-Centric AI Algorithm Leaderboards

The rise of Machine Learning (ML) has created new challenges related to the availability and efficiency of data sets for training and testing ML models. This is commonly known as the “data bottleneck” and is hindering the progress and implementation of ML models in various fields. In response, a platform and community called DataPerf have been developed to create competitions and leaderboards for data and data-centric AI algorithms.

One of the main problems with data sets is their quality. Public training and testing data sets are typically created from readily available sources such as web scrapings, forums, and Wikipedia or through crowdsourcing. However, these sources often suffer from problems such as bias, poor distribution, and poor quality. For example, visual data is often skewed towards wealthier regions, leading to skewed results. These quality problems then lead to quantity problems, where a large part of the data is of low quality, increasing the size and computational cost of the models. As public data sources become depleted, ML models can even become stagnant in terms of accuracy, slowing down progress. Therefore, improving the quality of training and test data is crucial for moving the AI community forward.

DataPerf seeks to address these challenges by providing a platform for the development of leaderboards for data and data-centric AI algorithms. The platform is inspired by ML leaderboards and aims to have a similar impact on data-centric AI research as ML leaderboards had on ML modeling research. The platform uses Dynabench, a data benchmarking tool, data-centric algorithms, and models.

JOIN the fastest ML subreddit community

DataPerf version 0.5 currently offers five challenges that focus on five common data-centric tasks across four different application domains. These challenges aim to compare and improve the performance of data-centric algorithms and models. Each challenge comes with design documents that describe the problem, model, quality objective, rules, and submission guidelines. The Dynabench platform includes a live leaderboard, an online assessment framework, and tracking of submissions over time.

The first two challenges focus on training data selection, where participants devise a strategy to select the best training set from a large pool of automatically extracted mislabeled training image or speech word clips candidates. The third challenge focuses on training data cleansing, where participants devise a strategy for choosing samples to relabel from a noisy training set, with the current version targeting image classification. The fourth challenge focuses on training dataset valuation, where participants devise a strategy to select the best training set from multiple data sellers based on the limited information exchanged between buyers and sellers. Finally, the fifth challenge, called Adversarial Nibbler, focuses on designing safe-looking ads that lead to generations of unsafe images in the text-to-image multimodal domain.

DataPerf provides a platform and community to develop competitions and leaderboards for data and data-centric AI algorithms. By addressing the data bottleneck through benchmarking and improving the quality of training and test data, DataPerf aims to make machine learning better for the future. The challenges that DataPerf offers are also intended to foster innovation and encourage new approaches to address the data bottleneck challenge in machine learning. Ultimately, DataPerf’s efforts could help overcome the limitations of existing data sets and enable the development of more accurate and reliable machine learning models across multiple domains.

review the Project and Reference article. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 17k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Niharika is a technical consulting intern at Marktechpost. She is a third year student, currently pursuing her B.Tech from the Indian Institute of Technology (IIT), Kharagpur. She is a very enthusiastic individual with a strong interest in machine learning, data science, and artificial intelligence and an avid reader of the latest developments in these fields.

Must Read: What is AI Hallucination? What goes wrong with AI chatbots? How to detect an amazing artificial intelligence?