With the advent of ai, its use is being felt in all spheres of our lives. ai is finding its application in all walks of life. But ai needs data for training. The effectiveness of ai largely depends on the availability of data for training purposes.
Conventionally, achieving accuracy in training ai models has been linked to the availability of substantial amounts of data. Addressing this challenge in this field involves navigating a wide potential search space. For example, The Open Catalyst Project uses more than 200 million data points related to potential catalyst materials.
The computing resources required for analysis and model development on such data sets are a big problem. Open Catalyst datasets used 16,000 GPU days to analyze and develop models. These training budgets are only available to some researchers, often limiting model development to smaller data sets or a portion of the available data. Consequently, model development is often limited to smaller data sets or a fraction of the available data.
A study by engineering researchers at the University of Toronto, published in nature communications, suggests that the belief that deep learning models require a large amount of training data may not always be true.
The researchers said we need to find a way to identify smaller data sets that can be used to train models. Hattrick-Simpers postdoctoral researcher Dr. Kangming Li used an example of a model that predicts students’ final scores and emphasized that it works best on the Canadian student data set it is trained on, but it may cannot predict grades for students from other countries.
One possible solution is to find subsets of data within incredibly large data sets to address the issues raised. These subsets should contain all the diversity and information of the original data set but be easier to handle during processing.
Li developed methods to locate high-quality subsets of information from materials data sets that have already been made public, such as JARVIS, The Materials Project, and Open Quantum Materials. The goal was to learn more about how the properties of the data sets affect the models they train.
To create his computer program, he used the original data set and a much smaller subset with 95% fewer data points. The model trained on 5% of the data had comparable performance to the model trained on the entire data set in predicting material properties within the domain of the data set. Accordingly, machine learning training can safely exclude up to 95% of the data with little or no effect on the accuracy of distribution predictions. Overrepresented material is the main issue of redundant data.
According to Li, the study’s findings provide a way to assess how redundant a data set is. If adding more data does not improve model performance, it is redundant and does not provide models with any new information to learn.
The study supports a growing body of knowledge among ai experts in multiple domains: models trained on relatively small data sets can perform well, as long as the data quality is high.
In conclusion, the importance of the richness of the information is highlighted more than the volume of data alone. The quality of information must be prioritized over the collection of enormous volumes of data.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
Rachit Ranjan is a consulting intern at MarktechPost. He is currently pursuing his B.tech from the Indian Institute of technology (IIT), Patna. He is actively shaping his career in the field of artificial intelligence and data science and is passionate and dedicated to exploring these fields.
<!– ai CONTENT END 2 –>