You've probably heard that a picture is worth a thousand words, but can a large language model (LLM) get the picture if it's never seen pictures before?
It turns out that language models that are trained exclusively on text have a solid understanding of the visual world. They can write image rendering code to generate complex scenes with intriguing objects and compositions, and even when that knowledge is not used correctly, LLMs can refine their images. Researchers at MIT's Computer Science and artificial intelligence Laboratory (CSAIL) observed this when they asked language models to self-correct their code for different images, where the systems improved their simple clipart drawings with each query.
Visual knowledge of these language models is gained from how concepts such as shapes and colors are described on the Internet, whether in language or code. When given an instruction such as “draw a parrot in the jungle,” users activate the LLM to consider what was read in the previous descriptions. To assess how much visual knowledge LLMs have, the CSAIL team built a “vision check” for LLMs: using their “Visual Aptitude Dataset,” they tested the models' abilities to draw, recognize, and self-correct for these concepts. . By compiling each final draft of these illustrations, the researchers trained a computer vision system that identifies the content of real photographs.
“Basically, we train a vision system without directly using any visual data,” says Tamar Rott Shaham, co-lead author of the study. study and a postdoc in electrical engineering and computer science (EECS) from MIT at CSAIL. “Our team consulted language models to write image representation codes to generate data for us and then trained the vision system to evaluate natural images. We were inspired by the question of how visual concepts are represented through other media, such as text. To express their visual knowledge, LLMs can use code as a common point between text and vision.”
To build this dataset, the researchers first queried the models to generate code for different shapes, objects, and scenes. They then compiled that code to represent simple digital illustrations, such as a row of bicycles, showing that LLMs understand spatial relationships well enough to draw the two-wheeled vehicles in a horizontal row. As another example, the model generated a car-shaped cake by combining two random concepts. The language model also produced a bright light bulb, indicating its ability to create visual effects.
“Our work shows that when you consult an LLM (without prior multimodal training) to create an image, it knows a lot more than it seems,” says Pratyusha Sharma, co-lead author, EECS PhD student and CSAIL member. “Let's say you ask him to draw a chair. The model knows other things about this piece of furniture that it may not have rendered immediately, so users can consult the model to improve the image it produces with each iteration. Surprisingly, the model can iteratively enrich the drawing by improving the rendering code to a large extent.”
The researchers put together these illustrations, which were then used to train a computer vision system that can recognize objects within real photographs (despite never having seen one before). With this text-generated synthetic data as the only benchmark, the system outperforms other procedurally generated image datasets that were trained on authentic photographs.
The CSAIL team believes that combining the hidden visual knowledge of LLMs with the artistic capabilities of other ai tools, such as diffusion models, could also prove beneficial. Systems like Midjourney sometimes lack the expertise to constantly modify the finer details of an image, making it difficult for them to handle requests such as reducing the number of cars shown in the image or placing one object behind another. If an LLM outlined the requested change to the dissemination model in advance, the resulting edit might be more satisfactory.
The irony, as Rott Shaham and Sharma acknowledge, is that LLMs sometimes don't recognize the very concepts they can draw from. This became clear when the models incorrectly identified human recreations of images within the data set. Such diverse representations of the visual world probably triggered the misconceptions of linguistic models.
While the models struggled to perceive these abstract representations, they demonstrated the creativity to draw the same concepts differently each time. When the researchers asked the LLMs to draw concepts like strawberries and archways multiple times, they produced images from various angles with different shapes and colors, hinting that the models might have actual mental images of visual concepts (rather than reciting examples they saw before). .
The CSAIL team believes this procedure could be a basis for evaluating how well a generative ai model can train a computer vision system. Additionally, researchers are seeking to expand the tasks in which they challenge linguistic models. Regarding their recent study, the MIT group notes that they do not have access to the training set of the LLMs they used, making it difficult to further investigate the origin of their visual knowledge. In the future, they intend to explore forming an even better vision model by allowing the LLM to work directly with it.
Sharma and Rott Shaham join forces paper by former CSAIL affiliate Stephanie Fu '22, MNG '23, and EECS doctoral students Manel Baradad, Adrián Rodríguez-Muñoz '22, and Shivam Duggal, all affiliated with CSAIL; as well as MIT Associate Professor Phillip Isola and Professor Antonio Torralba. His work was supported, in part, by a grant from the MIT-IBM Watson ai Lab, a LaCaixa Fellowship, the Zuckerman STEM Leadership Program, and the Viterbi Fellowship. They present their paper this week at the IEEE/CVF Computer Vision and Pattern Recognition Conference.