A recent paper titled “Image Captions Are Scalable Vision Learners Too” presents an intriguing approach called CapPa, which aims to establish image captions as a competitive pre-training strategy for vision pillars. The paper, authored by a DeepMind research team, highlights the potential for CapPa to rival the impressive performance of Contrastive Language Image Pretraining (CLIP), while offering simplicity, scalability, and efficiency.
The researchers extensively compared Cap, its image captioning strategy, and the popular CLIP approach. They carefully matched the pretraining computation, model capacity, and training data between the two strategies to ensure a fair evaluation. The researchers found that the Cap Vision backbones outperformed the CLIP models on several tasks, including classification of few shots, closed captioning, optical character recognition (OCR), and visual question response (VQA). Furthermore, when transferred to classification tasks with large labeled training data, the Cap Vision backbones achieved performance comparable to that of CLIP, indicating their potential superiority in subsequent multimodal tasks.
To further improve the performance of Cap, the researchers introduced the pretraining procedure CapPa, which combines autoregressive prediction (Cap) with parallel prediction (Pa). They used Vision Transformer (ViT) as a vision encoder, taking advantage of its strong capabilities in image understanding. To predict image captions, the researchers used a standard Transformer decoder architecture, incorporating cross-attention to use the ViT-encoded sequence in the decoding process effectively.
Instead of training the model solely in an autoregressive fashion at the training stage, the researchers adopted a parallel prediction approach in which the model predicts all subtitle tokens independently and simultaneously. By doing so, the decoder can heavily rely on image information to improve prediction accuracy, since it has access to the full set of tokens in parallel. This strategy allows the decoder to take advantage of the rich visual context provided by the image.
The researchers conducted a study to assess the performance of CapPa compared to conventional Cap and the next-generation CLIP approach on a wide range of downstream tasks, including image classification, subtitles, OCR, and VQA. The results were very promising, as CapPa consistently outperformed Cap on almost every task. Also, compared to CLIP* trained with the same batch size, CapPa achieved comparable or better performance. In addition, CapPa displayed strong zero-triggering capabilities, allowing effective generalization to unseen tasks, and exhibited promising scaling properties, indicating its potential to handle larger-scale data sets and models.
Overall, the work presented in the paper establishes image captioning as a competitive pre-training strategy for vision spines. By showing the effectiveness of CapPa in achieving high-quality results in various downstream tasks, the research team hopes to inspire further exploration of closed captioning as a pre-training task for vision coders. With its simplicity, scalability, and efficiency, CapPa opens up exciting possibilities for advancing vision-based models and pushing the boundaries of multimodal learning.
review the Paper. Don’t forget to join our 24k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
featured tools Of AI Tools Club
🚀 Check out 100 AI tools at AI Tools Club
Niharika is a technical consulting intern at Marktechpost. She is a third year student, currently pursuing her B.Tech from the Indian Institute of Technology (IIT), Kharagpur. She is a very enthusiastic person with a strong interest in machine learning, data science, and artificial intelligence and an avid reader of the latest developments in these fields.