A simple vision encoder and text decoder architecture for multimodal tasks – Google AI Blog

Posted by AJ Piergiovanni and Anelia Angelova, Research Scientists, Google Research

Fundamental vision-language models are based on the premise of a single prior training followed by subsequent adaptation to multiple subsequent tasks. Two main and unrelated training scenarios are popular: a SHORTEN-contrastive learning style and prediction of the next token. Contrastive learning trains the model to predict whether image and text pairs are correctly matched, effectively creating visual and text representations for the corresponding image and text inputs, while next token prediction predicts the next closest text token. probable in a sequence, thus learning to generate text. depending on the task required. contrastive learning It allows image-text and text-image retrieval taskssuch as finding the image that best matches a certain description, and learning the next token enables text generation tasks, such as image captions and Visual Answer to Questions (VQA). While both approaches have shown powerful results, when a model is contrastively pretrained, it generally does not do well on text generation tasks and vice versa. Furthermore, adaptation to other tasks is often done with complex or inefficient methods. For example, to extend a vision-language model to video, some models must make inferences for each video frame separately. This limits the size of videos that can be rendered to only a few frames and does not take full advantage of the motion information available in all frames.

Motivated by this, we present “A simple architecture for co-learning for multimodal tasks”, called the MaMMUT, which is capable of co-training for these competing goals and which provides a foundation for many vision and language tasks, either directly or through simple adaptation. MaMMUT is a compact 2B-parameter multimodal model that trains on contrastive, text-generative, and location-aware targets. It consists of a single image encoder and a text decoder, allowing direct reuse of both components. Furthermore, an easy adaptation to video-text tasks requires only using the image encoder once and can handle many more frames than previous work. In line with recent linguistic models (eg, PaLM, GLaM, GPT3), our architecture uses a decoder-only text model and can be considered as a simple extension of the language models. While modest in size, our model exceeds the state of the art or achieves competitive performance in image-text and text-image recoveryvideo question answering (VideoQA), video subtitles, open vocabulary detectionand VQA.

The MaMMUT model allows a wide range of tasks, such as image-text/text-image retrieval (up to the left and top right), VQA (left middle), open vocabulary detection (right middle) and Video QA (below).

Decoder-only model architecture

A surprising finding is that a single language decoder is sufficient for all these tasks, avoiding the need for the complex constructs and training procedures presented above. For example, our model (presented on the left in the figure below) consists of a single visual encoder and a single text decoder, connected via cross attention, and simultaneously trains in types of contrastive and text-generating losses. Comparatively, the previous work is not capable of handling image and text recovery tasks, or applies only some loss to only some parts of the model. To enable multimodal tasks and take full advantage of the decoder-only model, we need to jointly train the contrastive losses and losses similar to text generative subtitles.

MaMMUT architecture (left) is a simple construct consisting of a single-view encoder and a single-text decoder. Compared to other popular vision and language models, for example, PALI (half) and ALBEF, Coke (good) — trains jointly and efficiently for multiple vision and language tasks, with losses of both contrast and text generation, fully sharing the weights between tasks.

Decoder Two-Step Learn

Models with decoder only for language learning show clear performance advantages with a smaller model size (almost half the parameters). The main challenge in applying them to multimodal environments is to unify contrastive learning (which uses an unconditional sequence level representation) with subtitles (which optimizes the probability of a token conditional on previous tokens). We propose a two-step approach to jointly learn these two types of conflicting text representations within the decoder. During the first pass, we use cross-attention and causal masking to learn the caption generation task: text features can attend to image features and predict tokens in sequence. In the second pass, we disabled cross-attention and causal masking to learn the contrastive task. The text features will not see the image features, but can bidirectionally assist all text tokens at once to produce the final text-based rendering. Completing this two-step approach within the same decoder allows us to accommodate both types of tasks that were previously difficult to reconcile. Although simple, we show that this model architecture can provide a foundation for multiple multimodal tasks.

Two-step learning with only the MaMMUT decoder allows for both contrastive and generative learning paths using the same model.

Another advantage of our architecture is that since it is trained for these disjointed tasks, it can be seamlessly applied to multiple applications such as image-text and text-image retrieval, VQA, and subtitles.

Also, MaMMUT easily adapts to video language tasks. Previous approaches used a vision encoder to process each frame individually, requiring it to be applied multiple times. This is slow and restricts the number of frames the model can handle, usually to only 6 or 8. With MaMMUT, we use scarce video tubes for a light adaptation directly through the spatiotemporal information of the video. Furthermore, the adaptation of the model to open vocabulary detection it is done simply by training to detect bounding boxes via an object detection head.

Adaptation of the MaMMUT architecture to video tasks (left) is simple and completely reuses the model. This is done by generating a feature representation of video “tubes”, similar to image patches, which are mapped onto lower dimensional tokens and run through the vision encoder. Unlike previous approaches (good) that need to run multiple single images through the vision encoder, we use it only once.

Results

Our model achieves excellent zero shot results in image-text and text-image retrieval without any adaptation, outperforming all previous state-of-the-art models. VQA results are competitive with next-generation results, which are achieved with much larger models. He PALI model (parameters 17B) and the flamingo model (80B) have the best performance in the VQA2.0 data setbut MaMMUT (2B) has the same precision as the 15B PaLI.

MaMMUT exceeds the state of the art (SOTA) in Zero-Shot Text-Image (I2T) and Text-Image (T2I) retrieval in both MS-COCO (above) and flickr (below) points of reference.

Performance in the VQA2.0 data set it is competitive but does not outperform large models like the Flamingo-80B and PalI-17B. Performance is tested on the most challenging open text generation configuration.

MaMMUT also surpasses the state of the art in VideoQA, as shown below in the MSRVTT-QA and MSVD-QA data sets. Note that we outperform much larger models such as Flemishwhich is specifically designed for image+video pretraining and is pretrained with image text and video text data.

MaMMUT outperforms SOTA models in VideoQA tasks (MSRVTT-QA data set, aboveMSVD-QA data set, below), outperforming much larger models, for example, the 5B GIT2 or Flamingo, which uses 80B parameters and is pretrained for both picture language and vision language tasks.

Our results exceed the state of the art in open vocabulary detection fine tuning, as also shown below.

Main ingredients

We show that the joint training of contrastive targets and text generators is not an easy task, and in our ablations we found that these tasks are best accomplished with different design options. We see that fewer cross-attend connections are better for recovery tasks, but VQA tasks prefer more. However, while this shows that our model design choices may be suboptimal for individual tasks, our model is more effective than larger or more complex models.

Ablation studies showing that fewer cross-attention connections (1-2) are better for retrieval tasks (above), while more connections favor text-generating tasks like VQA (below).

Conclusion

We present MaMMUT, a compact single vision encoder language decoder model that jointly trains a set of conflicting goals to reconcile contrastive and text-generating tasks. Our model also serves as the foundation for many more vision and language tasks, achieving competitive or cutting-edge performance in image-text and text-image retrieval, videoQA, video captioning, open vocabulary detection, and VQA. We hope it can be used for more multimodal applications.

Thanks

The co-authors of the described work are: Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui and Anelia Angelova. Thank you Mojtaba Seyedhosseini, Vijay Vasudevan, Priya Goyal, Jiahui Yu, Zirui Wang, Yonghui Wu, Runze Li, Jie Mei, Radu Soricut, Qingqing Huang, Andy Ly, Nan Du, Yuxin Wu, Tom Duerig, Paul Natsev, Zoubin Ghahramani for your help and support.

A simple vision encoder and text decoder architecture for multimodal tasks – Google AI Blog

Technical Terrence Team

Synthetic data: what are they and what are they for?

Leave a Reply Cancel reply

Recommended.

This AI paper from Anthropic and Redwood Research reveals the first empirical evidence of alignment falsification in LLM without explicit training

Bitfinex hacker sentenced to 5 years in prison for mining almost 120,000 BTC

WazirX surprises users by shutting down the NFT marketplace

This Technical Indicator Suggests Now is the Right Time to Invest in Bitcoin

Superrationality and DAOs | Ethereum Foundation Blog

Categories

Important Links

A simple vision encoder and text decoder architecture for multimodal tasks – Google AI Blog

Decoder-only model architecture

Decoder Two-Step Learn

Results

Main ingredients

Conclusion

Thanks

Related

Technical Terrence Team

Synthetic data: what are they and what are they for?

Leave a Reply Cancel reply

Recommended.

This AI paper from Anthropic and Redwood Research reveals the first empirical evidence of alignment falsification in LLM without explicit training

Bitfinex hacker sentenced to 5 years in prison for mining almost 120,000 BTC

WazirX surprises users by shutting down the NFT marketplace

This Technical Indicator Suggests Now is the Right Time to Invest in Bitcoin

Superrationality and DAOs | Ethereum Foundation Blog

Categories

Important Links

Get daily news updates to your inbox!