Researchers at MIT's Computer Science and artificial intelligence Laboratory (CSAIL) and Google Research may have performed digital witchcraft, in the form of a diffusion model that can change the material properties of objects in images.
Nicknamed Alchemist, the system allows users to alter four attributes of real and ai-generated images: roughness, metallicity, albedo (the initial base color of an object), and transparency. As an image-to-image diffusion model, you can input any photograph and then adjust each property within a continuous scale of -1 to 1 to create a new image. These photo editing capabilities could extend to improving video game models, expanding ai capabilities in visual effects, and enriching robotic training data.
The magic behind Alchemist starts with a denoising diffusion model: In practice, the researchers used Stable Diffusion 1.5, which is a text-to-image conversion model praised for its photorealistic results and editing capabilities. Previous work built on the popular model to allow users to make higher-level changes, such as swapping objects or altering the depth of images. In contrast, CSAIL and Google Research's method applies this model to focus on low-level attributes, reviewing the finer details of an object's material properties with a unique slider-based interface that outperforms its counterparts.
While previous diffusion systems could pull a proverbial rabbit out of a hat to obtain an image, Alchemist could transform that same animal to appear translucent. The system could also make a rubber duck look metallic, remove the gold hue from a goldfish, and polish an old shoe. Programs like Photoshop have similar capabilities, but this model can change material properties in a simpler way. For example, modifying the metallic appearance of a photo requires several steps in the widely used application.
“When you look at an image you've created, often the result is not exactly what you have in mind,” says Prafull Sharma, a PhD student in electrical and computer engineering at MIT, affiliated with CSAIL and lead author of a new paper describing the work. “You want to control the image as you edit it, but existing controls in image editors can't change materials. “With Alchemist, we take advantage of the photorealism of the results of text-to-image models and create a slider that allows us to modify a specific property after providing the initial image.”
Precise control
“Text-to-image generative models have enabled everyday users to generate images as easily as writing a sentence. However, controlling these models can be a challenge,” says Jun-Yan Zhu, an assistant professor at Carnegie Mellon University, who was not involved in the paper. “While generating a vase is simple, synthesizing a vase with specific material properties, such as transparency and roughness, requires users to spend hours testing different text prompts and random seeds. This can be frustrating, especially for professional users who require precision in their work. Alchemist presents a practical solution to this challenge by enabling precise control over the materials of an input image while leveraging data-driven backgrounds from large-scale diffusion models, inspiring future work to seamlessly incorporate generative models into images. existing commonly used content creation interfaces. software.”
Alchemist's design capabilities could help modify the appearance of different models in video games. Applying such a diffusion model in this area could help creators speed up their design process, refining textures to fit the gameplay of a level. Additionally, Sharma and his team's project could help alter graphic design elements, videos and film effects to improve photorealism and accurately achieve the desired material look.
The method could also refine robotic training data for tasks such as manipulation. By presenting machines with more textures, they can better understand the various elements they will pick up in the real world. Alchemist can even potentially help with image classification, analyzing where a neural network fails to recognize material changes in an image.
The work of Sharma and his team surpassed similar models by faithfully editing only the requested object of interest. For example, when a user asked different models to modify a dolphin for maximum transparency, only Alchemist achieved this feat and left the ocean floor unedited. When the researchers trained the InstructPix2Pix comparable diffusion model on the same data as their comparison method, they found that Alchemist achieved higher accuracy scores. Additionally, a user study revealed that the MIT model was preferred and considered more photorealistic than its counterpart.
Keeping it real with synthetic data
According to the researchers, collecting real data was not practical. Instead, they trained their model on a synthetic data set, randomly editing the material attributes of 1,200 materials applied to 100 unique, publicly available 3D objects in Blender, a popular computer graphics design tool.
“Until now, the control of generative ai image synthesis has been limited by what the text can describe,” says Frédo Durand, Amar Bose Professor of Computing at the Department of Electrical Engineering and Computer Science (EECS). from MIT and a member of CSAIL, who is a senior author of the paper. “This work opens up new, finer control for visual attributes inherited from decades of computer graphics research.”
“Alchemist is the type of technique needed to make diffusion and machine learning models practical and useful for the CGI community and graphic designers,” adds Mark Matthews, senior software engineer and co-author at Google Research. “This kind of uncontrollable stochasticity can be fun for a while, but at some point, you need to do real work that obeys a creative vision.”
Sharma's latest project comes a year after he led research on Materialistic, a machine learning method that can identify similar materials in an image. This previous work demonstrated how ai models can hone their materials understanding skills and, like Alchemist, was fine-tuned on a synthetic dataset of 3D Blender models.
Still, Alchemist has some limitations at the moment. The model has difficulty correctly inferring lighting, so it sometimes does not follow user input. Sharma points out that this method also sometimes produces physically implausible transparencies. Imagine a hand partially inside a cereal box, for example: at Alchemist's maximum setting for this attribute, you would see a clear container without your fingers getting into it.
The researchers would like to expand on how such a model could improve 3D resources for scene-level graphics. Additionally, Alchemist could help infer material properties from images. According to Sharma, this type of work could unlock links between the visual and mechanical characteristics of objects in the future.
William T. Freeman, a professor at MIT EECS and a CSAIL member, is also lead author and joins Varun Jampani and Google Research scientists Yuanzhen Li PhD '09, Xuhui Jia, and Dmitry Lagun. The work was funded, in part, by a grant from the National Science Foundation and donations from Google and amazon. The group's work will be highlighted at CVPR in June.