Computers have two remarkable capabilities with regard to images: they can identify them and generate them again. Historically, these functions have been separate, such as the disparate acts of a chef who is good at creating dishes (generation) and a connoisseur who is good at tasting dishes (recognition).
Yet one cannot help but wonder: What would it take to orchestrate a harmonious union between these two distinctive capabilities? Both the chef and the connoisseur share a common understanding of how food tastes. Similarly, a unified vision system requires a deep understanding of the visual world.
Now, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have trained a system to infer the missing parts of an image, a task that requires a deep understanding of the image’s content. By successfully filling in the blanks, the system, known as the Masked Generative Encoder (MAGE), achieves two goals at the same time: accurately identify images and create new ones with a striking resemblance to reality.
This dual-purpose system allows for myriad potential applications, such as identifying and classifying objects within images, learning quickly from minimal examples, creating images under specific conditions such as text or class, and enhancing existing images.
Unlike other techniques, MAGE does not work with raw pixels. Instead, it turns images into what are called “semantic tokens,” which are compact, yet abstract, versions of an image section. Think of these tiles as mini puzzle pieces, each representing a 16×16 patch of the original image. Just as words form sentences, these tokens create an abstract version of an image that can be used for complex processing tasks, while preserving the information in the original image. Such a tokenization step can be trained within a self-monitoring framework, allowing it to be pretrained on large unlabeled image datasets.
Now, the magic begins when MAGE uses “masked token modeling”. She randomly hides some of these tokens, creating an incomplete puzzle, and then trains a neural network to fill in the gaps. In this way, she learns both to understand the patterns in an image (image recognition) and to generate new ones (image generation).
“A remarkable part of MAGE is its variable masking strategy during pretraining, allowing you to train for tasks, imaging, or reconnaissance, within the same system,” says Tianhong Li, an electrical and computer engineering doctoral student at the M.I.T. , a CSAIL affiliate, and lead author on a article about the investigation. “MAGE’s ability to work in ‘chip space’ rather than ‘pixel space’ results in the generation of clear, detailed, high-quality images, as well as semantically rich image representations. Hopefully this could pave the way for advanced and integrated computer vision models.”
In addition to its ability to generate realistic images from scratch, MAGE also allows for conditional image generation. Users can specify certain criteria for the images they want MAGE to generate, and the tool will create the appropriate image. It is also capable of performing image editing tasks such as removing elements from an image while maintaining a realistic appearance.
Reconnaissance tasks are another strong point of MAGE. With its ability to pretrain on large unlabeled data sets, you can classify images using only the learned representations. Additionally, it excels at learning few shots, achieving impressive results on large image data sets such as ImageNet with only a handful of labeled examples.
The validation of MAGE’s performance has been impressive. On the one hand, it set new records for the generation of new images, surpassing previous models with a significant improvement. On the other hand, MAGE aced the recognition tasks, achieving 80.9% accuracy in linear probing and 71.9% accuracy in 10 shots on ImageNet (this means it correctly identified images in 71.9% of the cases where I only had 10 labeled examples of each). class).
Despite its strengths, the research team acknowledges that MAGE is a work in progress. The process of converting images to tokens inevitably leads to some loss of information. They are eager to explore ways to compress images without losing important detail in future work. The team also intends to test MAGE on larger data sets. Future exploration could include training MAGE on larger unlabeled data sets, which could lead to even better performance.
“It has been a long dream to achieve imaging and image recognition in one system. MAGE is groundbreaking research that successfully harnesses the synergy of these two tasks and achieves the state-of-the-art of them in a single system,” says Huisheng Wang, Senior Software Engineer of Humans and Interactions at Machine Research and Intelligence. division of Google, which was not involved in the work. “This innovative system has a wide range of applications and has the potential to inspire much future work in the field of machine vision.”
Li co-authored the paper with Dina Katabi, the Thuan and Nicole Pham Professor in MIT’s Department of Electrical Engineering and Computer Science and a CSAIL principal investigator; Huiwen Chang, a senior research scientist at Google; Shlok Kumar Mishra, University of Maryland doctoral student and Google Research intern; Han Zhang, a senior Google research scientist; and Dilip Krishnan, a Google staff research scientist. Computational resources were provided by Google Cloud Platform and the MIT-IBM Watson Research Collaboration. The team’s research was presented at the 2023 Conference on Computer Vision and Pattern Recognition.