A diffusion to govern diffusion: modulating pretrained diffusion models for multimodal image synthesis

Imaging AI models have stormed the domain in recent months. You’ve probably heard of midjourney, DALL-E, ControlNet, or Stable dDiffusion. These models are capable of generating photorealistic images with cues given, no matter how rare the cue given. Do you want to see Pikachu running around Mars? Go ahead, ask one of these models to do it for you, and you will get it.

Existing diffusion models are based on large-scale training data. When we say large scale, it’s really big. For example, Stable Diffusion was trained on more than 2.5 billion image caption pairs. Therefore, if you planned to train your own diffusion model at home, you may want to reconsider, as training these models is extremely expensive in terms of computational resources.

On the other hand, existing models are generally not conditioned or are conditioned to an abstract format like text messages. This means that they only take one thing into account when generating the image and it is not possible to pass external information such as a segmentation map. Combining this with their reliance on large-scale data sets means that large-scale generation models have limited applicability in domains where we do not have a large-scale data set to train on.

One approach to overcome this limitation is to tune the pretrained model for a specific domain. However, this requires access to model parameters and significant computational resources to compute gradients for the full model. Furthermore, fitting a full model limits its applicability and scalability, as new full-size models are required for each new domain or modality combination. Also, due to the large size of these models, they tend to quickly overfit the smaller subset of data on which they are fitted.

It is also possible to train models from scratch, subject to the chosen modality. But again, this is limited by the availability of training data and it is extremely expensive to train the model from scratch. On the other hand, people tried to guide a previously trained model at the moment of inference towards the desired result. They use gradients from a pre-trained classifier or from a CLIP network, but this approach slows down model sampling as it adds many computations during inference.

What if we could use any existing model and adapt it to our condition without requiring an extremely expensive process? What if we don’t get into the cumbersome and time consuming process of altering the diffusion mode? Would it be possible to condition it still? The answer is yes, and let me introduce it to you.

The proposed approach, multimodal conditioning modules (MCM), is a module that could be integrated into existing broadcast networks. It uses a small broadcast-like network that is trained to modulate the predictions of the original broadcast network at each sampling time step so that the generated image follows the provided conditioning.

MCM does not require the original diffusion model to be trained in any way. The only training is done for the modulating network, which is small scale and not expensive to train. This approach is computationally efficient and requires fewer computational resources than training a broadcast network from scratch or tuning an existing broadcast network, since it does not require computing gradients for the large broadcast network.

Also, MCM generalizes well even when we don’t have a large training data set. It doesn’t slow down the inference process as there are no gradients to calculate and the only computational overhead comes from running the small broadcast network.

The addition of the multimodal conditioning module adds more control to imaging by being able to condition on additional modalities such as a segmentation map or sketch. The main contribution of the approach is the introduction of multimodal conditioning modules, a method to adapt pretrained diffusion models for conditional image synthesis without changing the parameters of the original model and achieve high-quality and diverse results while being more cheap and uses less memory than training from scratch or tuning a large model.

review the Paper and Project All credit for this research goes to the researchers of this project. Also, don’t forget to join our 16k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. She wrote her M.Sc. thesis on denoising images using deep convolutional networks. She is currently pursuing a PhD. She graduated from the University of Klagenfurt, Austria, and working as a researcher in the ATHENA project. Her research interests include deep learning, computer vision, and multimedia networks.