Recently, researchers have seen an increase in interest in image and language representation learning, with the goal of capturing the intricate relationship between visual and textual information. Among all Contrastive Language and Image Pretraining (CLIP) frameworks, it has emerged as a promising approach, demonstrating state-of-the-art performance on various tasks and robustness against out-of-distribution data. While previous studies focused on scaling CLIP with extensive computational resources, this research investigates its performance under resource constraints, exploring the scaling down of CLIP in terms of data, architecture, and training strategies. The study, conducted on the WebLI dataset with more than 3.4 billion image-text pairs, establishes calculation limits and evaluates different pre-training strategies.
CLIP, presented as a joint pre-training framework for image and text representations, uses a contrastive loss function to learn shared embedding spaces. It achieves remarkable zero-shot performance in visual classification tasks. Extensions like LiT and SLIP improve the efficiency of CLIP. Efforts to scale CLIP, including FLIP and other methods, aim to improve efficiency and scalability, although the focus remains on large computational resources.
Researchers from the University of California and Google DeepMind present research on CLIP performance under constrained computing budgets, exploring three key dimensions: data, architecture, and training strategies. It underlines the importance of high-quality training data, revealing that smaller, high-quality data sets can outperform larger, lower-quality ones. Additionally, the researchers investigated how model performance varies with data set size, suggesting that smaller Vision Transformer (ViT) models are better suited for smaller data sets. In contrast, larger models excel with fixed computing. Provides information on how to choose between CNN-based and ViT-based architectures for CLIP training.
The training process mirrors CLIP's approach, employing contrast loss to train text and vision encoders, encouraging similar representations for corresponding image-text pairs. The WebLI dataset, comprising more than 10 billion multi-language image-text pairs, is the experimental basis and focuses on English pairs numbering approximately 3.4 billion. Text processing involves a SentencePieza tokenizer with a vocabulary size of 32k. Evaluation metrics encompass zero-shot transfer, linear probe, and retrieval performance in MSCOCO captions, adhering to established protocols for fair comparisons and evaluations of model generalization and effectiveness.
MLP-Mixer outperforms other architectures with fewer samples in linear probing, but ViT-B/32 excels as sample size increases, especially in out-of-distribution (OOD) variants. ViT is preferred for its robustness and standard accuracy with larger sample sizes, while ResNet is suitable for smaller samples. ViT and MLP-Mixer demonstrate greater robustness and generalization to out-of-distribution data sets due to their lower inductive bias.
On retrieval tasks, ResNet-50 performs better with smaller sample sizes, but ViT-B/32 outperforms it with sample sizes exceeding 400 million for both few-shot and retrieval tasks. Mixer-B/32 consistently exhibits the worst performance on retrieval tasks. These findings indicate that ViT is the preferred choice for the vision encoder in zero-shot, linear probe, few-shot, and recall tasks.
In conclusion, the paper investigates the influence of data size, network architecture, and training strategies on CLIP performance. It underscores the importance of data quantity and quality and shows how data augmentation techniques can boost CLIP performance without imposing substantial computational costs. Additionally, the study investigates various network architectures and training strategies, revealing that certain options excel with different computational budgets. This emphasizes the need for meticulous selection to optimize CLIP performance effectively.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our SubReddit over 40,000ml
For content association, please Complete this form here.
Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>