Huge large language models (LLMs) can exhibit compelling abilities like producing human-like speech and answering complicated queries because they have been trained at scale on large corpora of text. While it’s certainly surprising, most cutting-edge LLMs are trained with text-only data downloaded from the internet. They often cannot absorb concepts based on the real world because they need to be exposed to rich visual cues. As a result, most of the language models currently in use show limits on tasks that need reasoning and visual foundation and cannot generate images either. In this article, they demonstrate how to effectively use the capabilities of a frozen LLM for multimodal (image and text) input and output.
They train the language model to learn a new [RET] token representing an image for image text retrieval. They are also familiar with linear mapping using contrastive learning to map the [RET] embeds for a caption to be close to the visual embeds for its associated image. Only the weights of the linear layers and the [RET] the token embedding is updated during training, and most of the model remains frozen. As a result, your suggested approach is high memory and computationally efficient. Once trained, a model demonstrates various abilities. You have new multimodal conversation and reasoning skills in addition to the original text-only LLM ability to create text. His suggested approach is model independent and can be used to underpin future versions of stronger or larger LLMs.
The language model is trained to learn a new [RET] token representing an image, and contrastive learning is used to learn a linear mapping that maps the [RET] embeds for a caption to be close to the visual embeds for its matching image. Only the weights of the linear layers and the [RET] the token embedding is updated during training, leaving most of the model fixed. As a result, your suggested approach is high memory and computationally efficient. 1Once taught, your model demonstrates various abilities. You have new multimodal conversation and reasoning skills in addition to the original text-only LLM ability to create text. His suggested approach is model independent and can be used to underpin future versions of stronger or larger LLMs.
It shows the highest sensitivity of text-to-image retrieval performed by autoregressive LLMs. One of his major contributions is the Frozen Retrieval Over Multimodal Data for Autoregressive Generation (FROMAGE) model, effectively trained using LLM visual anchoring through image captions and contrastive learning. Whereas previous algorithms require embedded image and text data at web scale, FROMAGe develops robust multimodal, few-shot capabilities from image caption pairings alone. Its method is more accurate on long and complicated free-form text than previous models. Demonstrate how current skills of pre-trained text-only LLMs, including learning in context, input sensitivity, and conversation building, can be used for tasks that require visual input.
They show: (1) retrieval of contextual images from sequences of embedded images and text; (2) good zero-shot performance in visual conversation; and (3) increased sensitivity to speech context for image retrieval. Their results open the door to models that can learn and produce long and coherent multimodal sequences. They also highlight the capabilities of pretrained text-only LLMs on visual tasks. To promote further research and development, its pre-trained code and models will be available to the general public soon.
review the Paper, Project, and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 13k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.