Pretraining of visual language (VL) models on web-scale image caption datasets has recently emerged as a powerful alternative to traditional pretraining on image classification data. Image caption data sets are considered more “open domain” because they contain broader scene types and vocabulary words, resulting in models with strong performance in few and zero shot reconnaissance tasks. However, images with detailed class descriptions may be rare, and the class distribution may be unbalanced, since image caption data sets do not go through manual curation. By contrast, large-scale classification data sets, such as ImageNet, are often curated and can therefore provide detailed categories with a balanced distribution of tags. While it may seem promising, directly combining captioning and classification data sets for pretraining is often unsuccessful as it can lead to biased representations that do not generalize well to various downstream tasks.
In “Prefix conditioning unifies language supervision and tags“, featured in CVPR 2023, we demonstrate a pretraining strategy that uses classification and subtitle data sets to provide complementary benefits. First, we show that naive unification of data sets results in suboptimal performance in subsequent zero-shot recognition tasks, as the model is affected by data set bias: domain coverage. picture and vocabulary words is different in each data set. We addressed this issue during training through prefix conditioning, a novel simple and effective method that uses prefix tokens to disentangle dataset biases from visual concepts. This approach allows the language coder to learn from both data sets while tailoring feature extraction to each data set. Prefix conditioning is a generic method that can be easily integrated into existing VL pretraining targets, such as Contrastive Image-Language Pretraining (CLIP) or Unified contrastive learning (UniCL).
high level idea
We note that classification data sets tend to be skewed in at least two ways: (1) images mostly contain unique objects from restricted domains, and (2) vocabulary is limited and lacks the linguistic flexibility required for classification. zero shot learning. For example, the ImageNet-optimized “a photo of a dog” class embedding typically results in a photo of a dog in the center of the image pulled from the ImageNet dataset, which does not generalize well to other datasets. data containing images of multiple dogs. in different spatial locations or a dog with other subjects.
By contrast, subtitle data sets contain a wider variety of scene types and vocabularies. As shown below, if a model simply learns from two data sets, the addition of language can entangle the subtitle and image classification data set bias, which can decrease generalizability in zero-shot classification. If we can tease out the bias of two data sets, we can use language embeds that fit the subtitle data set to improve generalizability.
Above: Language embedding entangling image classification bias and caption dataset. Below: Language embeddings disentangle the bias of two data sets. |
conditioning prefix
Prefix conditioning is partially inspired by fast tuning, which prepends learnable tokens to sequences of input tokens to instruct a previously trained model backbone to learn task-specific knowledge that can be used to solve subsequent tasks. The prefix conditioning approach differs from quick fit in two ways: (1) it is designed to unify classification data sets and image legends by untangling data set bias, and (2) it is applied to VL pretraining. while standard quick fit is used to fit models. Prefix conditioning is an explicit way to specifically direct the behavior of model stems based on the type of data sets provided by users. This is especially useful in production when the number of different types of data sets is known in advance.
During training, prefix conditioning learns one text token (prefix token) for each type of data set, absorbing data set bias and allowing the remaining text tokens to focus on learning visual concepts. Specifically, it prepends prefix tokens for each dataset type to the input tokens that inform the language and visual encoder of the type of input data (eg, rank vs. title). The prefix tokens are trained to learn the specific bias of the dataset type, allowing us to tease out that bias in language representations and use the learned embedding in the image caption dataset during test time, even without an input legend.
We use prefix conditioning for CLIP using a language and a visual encoder. During testing time, we employed the prefix used for the image captions dataset, as the dataset is supposed to cover broader vocabulary words and scene types, leading to better recognition performance. zero shot.
Illustration of prefix conditioning. |
Experimental results
We apply prefix conditioning to two types of contrastive loss, SHORTEN and UniCL, and assess their performance on zero-shot reconnaissance tasks compared to models trained with ImageNet21K (IN21K) and Conceptual 12M (CC12M). CLIP and UniCL models trained on two data sets using prefix conditioning show large improvements in zero shot classification accuracy.
Zero-shot classification accuracy of models trained only with IN21K or CC12M compared to CLIP and UniCL models trained with the two data sets using prefix conditioning (“Our”). |
Study on test time prefix
The following table describes the performance change for the prefix used during the test time. We show that by using the same prefix used for the classification dataset (“Prompt”), performance on the classification dataset (IN-1K) improvement. When the same prefix used for the caption dataset (“Title”) is used, performance on other datasets (Zero-shot AVG) is improved. This analysis illustrates that fitting the prefix to the caption dataset results in better generalization of scene types and vocabulary words.
Analysis of the prefix used for test-time. |
Study of robustness to the change of image distribution
We studied the change in image distribution using variants of ImageNet. We see that the “Title” prefix works better than “Ask” in ImageNet-R (IN-R) and ImageNet sketch (IN-S), but underperforms in ImageNet-V2 (IN-V2). This indicates that the “Title” prefix achieves generalization in domains far removed from the classification data set. Therefore, the optimal prefix is likely to differ depending on the distance between the test domain and the classification data set.
Analysis of the robustness to the change of distribution at the image level. IN: ImageNet, IN-V2: ImageNet-V2, IN-R: Art, Cartoon Style ImageNet, IN-S: ImageNet Sketch. |
Conclusion and future work
Introducing prefix conditioning, a technique for unifying image captions and classification data sets for better zero-shot classification. We show that this approach leads to better zero-shot classification accuracy and that the prefix can control bias in language embedding. One limitation is that the prefix learned in the subtitle data set is not necessarily optimal for zero-shot classification. Identifying the optimal prefix for each test data set is an interesting direction for future work.
Thanks
This research was conducted by Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Thanks to Zizhao Zhang and Sergey Ioffe for their valuable comments.