In recent years, there have been exceptional advances in artificial intelligence, with the introduction of many new advanced models, especially in NLP and Computer Vision. CLIP is a neural network developed by OpenAI trained on a massive dataset of text and image pairs. It has helped advance numerous computer vision research and supported modern recognition systems and generative models. The researchers believe that CLIP owes its effectiveness to the data it was trained on and believe that discovering the data curation process would allow them to create even more effective algorithms.
In this research work, researchers have attempted to make CLIP’s data curation approach available to the public and introduced Metadata Curated Language and Image Pretraining (MetaCLIP). MetaCLIP takes unorganized data and metadata derived from CLIP concepts, creates a balanced subset, and produces a balanced subset over the metadata distribution. It outperforms CLIP data on multiple benchmarks when applied to the CommonCrawl dataset with 400 million image-text pairs.
The authors of this article have applied the following principles to achieve their goal:
- The researchers first curated a new data set of 400 million image-text pairs collected from various Internet sources.
- Using substring matching, they align image and text pairs with metadata entries, effectively associating unstructured texts with structured metadata.
- Then, all texts associated with each metadata entry are grouped into lists, creating a mapping of each entry to the corresponding texts.
- The associated list is then subsampled, ensuring a more balanced data distribution. making it more commonly used for pre-workout.
- To formalize the curation process, they introduce an algorithm that aims to improve scalability and reduce space complexity.
MetaCLIP selects data without using images directly, but still improves the alignment of visual content by controlling the quality and distribution of text. The substring matching process makes it more likely that the text will mention the entities in the image, which increases the chances of finding the corresponding visual content. Additionally, balance favors long-tail entries, which may have more diverse visual content than main entries.
For the experiments, the researchers used two data sets: one to estimate a target of 400 million image-text pairs and the other to scale the curation process. As mentioned above, MetaCLIP outperforms CLIP when applied to CommonCrawl with 400 million data points. Furthermore, MetaCLIP outperforms CLIP in zero-shot ImageNet classification using ViT models of various sizes.
MetaCLIP achieves 70.8% accuracy in zero-shot ImageNet classification using a ViT-B model, while CLIP achieves 68.3% accuracy. MetaCLIP also achieves an accuracy of 76.2% using a ViT-L model, while CLIP achieves an accuracy of 75.5%. Scaling the training data to 2.5 billion image-text pairs and using the same training budget and similar distribution further improves the accuracy of MetaCLIP to 79.2% for ViT-L and 80.5% for ViT- H. These are unprecedented results for zero-shot ImageNet classification.
In conclusion, in an attempt to understand OpenAI’s CLIP data curation process in order to replicate its high performance, the authors of this article presented MetaCLIP, which outperforms CLIP data on multiple benchmarks. MetaCLIP achieves this by using substring matching to align image-text pairs with metadata entries and by subsampling the associated list to ensure a more balanced data distribution. This makes MetaCLIP a promising new approach to data curation and has the potential to enable the development of even more effective algorithms.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 32k+ ML SubReddit, Facebook community of more than 40,000 people, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
we are also in Telegram and WhatsApp.
I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and have a keen interest in Data Science, especially Neural Networks and its application in various areas.
<!– ai CONTENT END 2 –>