Unimodal models are designed to work with data in a single mode, which can be text or images. These models specialize in understanding and generating content specific to the chosen mode. For example, GPTs are great for generating human-like text. They have been used for tasks such as language translation, text generation, and question answering. Convolutional neural networks (CNN) are examples of image models that excel at tasks such as image classification, object detection, and image generation. Currently, many interesting tasks, such as visual question answering (VQA) and image and text retrieval, etc., require multimodal capabilities. Is it possible to combine text and image processing? Can! CLIP stands out as one of the highly successful initial image-text models, demonstrating proficiency in both image recognition and text understanding.
We will divide this article into the following sections:
- Introduction
- Architecture
- Training process and contrast loss.
- Zero firing capability
- CopaPL
- Conclusions
The CLIP model is an impressive zero-shot predictor, allowing predictions on tasks for which it has not been explicitly trained. As we will see in more detail in the following sections, by using natural language prompts to query images, CLIP can perform image classification without requiring task-specific training data. However, its performance can be significantly improved with a few tricks. In this series of articles, we will explore methods that leverage additional cues generated by large language models (LLMs) or some short training examples without involving any parameter training. These approaches offer a clear advantage as they are computationally less demanding and do not need to tune additional parameters.
CLIP is a dual-encoder model with two separate encoders for visual and textual modalities that code images and text independently. Such architecture is different from the fusion encoder that allows interaction between visual and textual modalities through cross-attention, which involves learning attention weights that help the model focus on specific regions of…