CMU Researchers Propose GILL: An Artificial Intelligence Method to Merge LLM with Image Encoder and Decoder Models

With the release of the new OpenAI GPT 4, multimodality in large language models was introduced. Unlike the previous version, GPT 3.5, which is only used to allow the well-known ChatGPT to take text input, the latest GPT-4 accepts text and images as input. Recently, a team of researchers at Carnegie Mellon University proposed an approach called Generating Images with Large Language Models (GILL), which focuses on extending multimodal language models to generate some cool single images.

The GILL method allows processing of inputs that are mixed with images and text to produce text, retrieve images, and create new images. GILL achieves this despite the fact that the models use different text encoders by transferring the output embedding space from a freeze-text-only LLM to that of a freeze-frame model. Unlike other methods that require embedded text and image data, mapping is accomplished by fine-tuning a small number of parameters using pairs of images and captions.

The team has mentioned that this method combines large language models for frozen text with models for image encoding and decoding that have already been trained. It can provide a wide range of multimodal capabilities, such as image retrieval, single image production, and multimodal dialog. This has been done by mapping the embed spaces of the modalities to merge them. GILL works with mixed image and text-conditioned input and produces output that is both coherent and readable.

🚀 JOIN the fastest ML subreddit community

This method provides an efficient mapping network that connects the LLM to a text-to-image generation model for high performance in image generation. This mapping network converts hidden text representations into the embedding space of visual models. In doing so, it uses LLM’s powerful text renderings to produce aesthetically consistent results.

With this approach, the model can retrieve images from a specific data set in addition to creating new images. The model chooses whether to produce or obtain an image at inference time. To make this choice, a learned decision module is used that is conditioned to the hidden representations of the LLM. This approach is computationally efficient as it works without the need to run the imaging model at training time.

This method works better than reference generation models, especially for tasks that require a longer and more sophisticated language. By comparison, GILL outperforms the Stable Diffusion method in processing longer-form text, including dialogue and speech. GILL performs better at dialog-conditioned image generation than non-LLM-based generation models, benefiting from multimodal context and generating images that better fit the given text. Unlike conventional text-to-image models that only process text inputs, GILL can also process arbitrarily embedded text-image inputs.

In conclusion, GILL (Generating Images with Large Language Models) looks promising as it presents a wider range of abilities compared to previous multimodal language models. Its ability to outperform non-LLM based generation models in various text-to-image tasks that measure context dependency makes it a powerful solution for multimodal tasks.

review the Paper and Project page. Don’t forget to join our 22k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]

🚀 Check out 100 AI tools at AI Tools Club

Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.