Google AI Introduces MaMMUT: A Simple Architecture for Co-Learning for Multimodal Tasks

The idea on which the fundamental vision-language models are built is that a single prior training can be used to adapt to a wide variety of subsequent activities. There are two widely used but distinct training scenarios:

Contrastive learning in the CLIP style. Trains the model to predict whether image and text pairs are correctly matched, effectively creating visual and text representations for the corresponding image and text inputs. Allows image-text and text-image retrieval tasks, such as selecting the image that best fits a specific description.
Next Token Prediction: Learn how to generate text by predicting the next most likely token in a sequence. It supports text-generating tasks like image captions and Visual Question Answering (VQA) while learning by contrast.

While both methods have shown promising results, pretrained models that cannot be transferred to other tasks tend to perform poorly on text generation tasks and vice versa. It is also common for complex or inefficient approaches to be used when adapting to new tasks.

To train together for these competitive goals and provide the foundation for numerous vision and language tasks, either directly or through easy adaptation, a recent Google study introduces MaMMUT, a simple architecture for co-learning for multimodal tasks. MaMMUT is a condensed multimodal model with only 2B parameters, and can be trained to achieve contrastive, text generation, and location detection goals. Its simple design, just an image encoder and a text decoder, makes it easy to recycle the two independently.

🚀 JOIN the fastest ML subreddit community

The proposed model comprises a single visual encoder and a single text decoder linked through cross-attention and simultaneously trains on contrastive and text-generating loss types. The previous work does not address image text recovery tasks or simply applies some loss to select aspects of the model. Contrastive losses and similar losses to text generative subtitles need to be trained together to enable multimodal tasks and fully utilize the decoder-only model.

There is a considerable performance gain with a smaller model size (almost half the parameters) for decoder-only models in language learning. One of the biggest obstacles to using them in multimodal situations is reconciling contrastive learning (which relies on unconditional representation at the sequence level) and subtitles (which optimize the probability of a token based on previous tokens). The researchers offer a two-step technique to learn these incompatible text representations within the decoder together.

Your initial run to learn the caption generation challenge uses cross-attention and causal masking so that text features can pay attention to image features and make sequential tile predictions. They turn off cross-attention and causal masking to learn the contrastive task in the second pass. While image functions will remain hidden from text functions, text functions will be able to assist in both directions on all text tokens simultaneously. Both tasks, which were previously difficult to reconcile, can now be handled by the same decoder thanks to the two-step technique. Although the architecture of this model is quite simple, it can serve as the basis for various multimodal tasks.

Since the architecture is trained for several separate tasks, it can be easily integrated into many applications, including image-text and text-image retrieval, visual quality assessment, and subtitles. The researchers use scattered video tubes to directly access the spatiotemporal information of the video for light adaptation. Training to detect bounding boxes via an object detection head is also required to transfer the model to open vocabulary detection.

Despite its compact design, MaMMUT provides superior or competitive results in several areas, including image-text and text-image retrieval, video question answering (VideoQA), video subtitles, open vocabulary identification, and VQA. The team notes that their model achieves better results than much larger models like Flamingo, which accommodates image+video pretraining and has already pretrained on image-text and video-text data.

review the Paper and google blog. Don’t forget to join our 21k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]

🚀 Check out 100 AI tools at AI Tools Club

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.