In multimodal learning, large image and text basic models have demonstrated excellent zero-shot performance and improved stability in a wide range of downstream tasks. Models such as Contrastive Language and Image Pretraining (CLIP) show significant improvement in multimodal ai due to their ability to analyze images and text simultaneously. Recently, a wide range of architectures have demonstrated their ability and performance to achieve vision tasks on resource-constrained devices; for example, pruning ViT architectures helps to obtain smaller and faster CLIP models.
However, models like CLIP use large transformer-based encoders with significant memory and latency overhead, posing challenges for implementation on mobile devices. Additionally, there are two problems that this article addresses: the first is the trade-off between runtime performance and accuracy of different architectures, which slows down the analysis of architectural designs. Furthermore, large-scale training of CLIP models is expensive and disrupts the rapid growth and exploration of DataCompDR-12M and DataCompDR-1B. The second issue highlights the reduced capacity of smaller architectures, which leads to lower accuracy.
Apple researchers introduced MobileCLIP, a new family of image and text models optimized for runtime performance through an efficient training approach, namely multimodal boost training. MobileCLIP establishes a new next-generation system to balance speed and accuracy and recover tasks on multiple data sets. Additionally, the training approach uses knowledge transfer from an image captioning model and a collection of robust CLIP encoders to improve the accuracy of efficient models. The additional knowledge is stored in a boosted dataset to avoid the computational overhead of training time for this training method.
The proposed multi-modal boosted training approach is combined with DataCompDR to solve the challenges addressed in this paper. Its accuracy is higher than that of the original data set for a given computing budget. This is achieved by storing synthetic captions and teacher embeddings in the dataset, followed by a dataset boosting strategy, which helps avoid additional training time. Its main components are (a) leveraging the knowledge of an image captioning model through synthetic captioning and (b) distilling the knowledge of image-text alignments from a collection of pre-trained robust CLIP models.
.
Three small variants of MobileCLIP are created with a 12-layer transformer base, and the fastest variant, MobileCLIP-S0, is five times faster and three times smaller than the standard ViT-B/16 CLIP model. Additionally, multi-modal boosted training achieves an average performance growth of +2.9% across 38 evaluation benchmarks when training the ViT-B/16 image backbone. In addition, to avoid noisy data sets, DataComp and data filtering networks are used to improve the quality of web-sourced data sets, and the CoCa model is used to increase the visual descriptiveness of subtitles and generate multiple synthetic subtitles. for each image.
In conclusion, the proposed model, MobileCLIP, is a new family of efficient image-text models optimized for runtime performance through an efficient training approach, i.e., multimodal boost training. The researchers also introduced DataCompDR, a training dataset boosted with knowledge of a pre-trained image captioning model and a collection of robust CLIP models. MobileCLIP models trained on DataCompDR establish a new state of the art for balancing speed and accuracy and retrieving tasks across multiple data sets.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our SubReddit over 40,000ml
Sajjad Ansari is a final year student of IIT Kharagpur. As a technology enthusiast, he delves into the practical applications of ai with a focus on understanding the impact of ai technologies and their real-world implications. His goal is to articulate complex ai concepts in a clear and accessible way.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>