Text to image generation The use of diffusion models has been a hot topic in generative modeling for the last few years. Diffusion models are capable of generating high-quality images of the concepts learned during training, but those training data sets are very large and not personalized. Now users want some customization on these models; Instead of generating images of a random dog somewhere, the user wants to create images of their dog somewhere in their house. A direct solution to this problem is to retrain the model involving the new information in the data set. But it has certain limitations: FirstTo learn a new concept, the model needs a large amount of data, but the user may only have a few examples. Secondretraining the model every time we need to learn a new concept is very inefficient. Thirdlearning new concepts will result in forgetting previously learned concepts.
To address these limitations, a team of researchers from Carnegie Mellon University, Tsinghua University and Adobe Research proposes a method to learn multiple new concepts without the need to completely retrain the model, just using a few examples. They listed their experiments and findings in the article “Multi-concept text-to-image diffusion customization.”
In this paper, the team proposed a fine-tuning technique, Custom broadcast for text-to-image broadcast models, which identifies a small subset of model weights such that fitting just those weights is sufficient to model the new concepts. At the same time, it prevents catastrophic forgetting and is highly efficient since only a very small number of parameters are trained. To further prevent forgetfulness, mixing of similar concepts, and overfitting to the new concept, a small set of real images with a similar title to the target images is chosen and fed to the model while fitting (Figure 2).
The method is based on Stable Diffusion, and up to 4 images are used as training examples during fitting.
We found that tuning only a small set of parameters is effective and highly efficient, but how do we choose those parameters and why does it work?
The idea behind this answer is simply an observation from experiments. The team trained the entire models on the data set involving new concepts and carefully watched how the weights of the different layers changed. The result of the observation was cross-attention layer weights were affected the most, implying that it plays a significant role while adjusting. The team took advantage of that and concluded that the model could be significantly customized simply by tweaking the cross-attention layers. And it works magnificently.
In addition to this, there is another important component in this approach: The regularization data set. Since we are using only a few samples for fine tuning, the model may overfit the target concept and lead to language drift. For example, training on “moongate” will cause the model to forget the association of “moon” and “gate” with previously learned concepts.. To avoid this, a set of 200 images is selected from the LAION-400M dataset with corresponding captions that are very similar to the captions in the target image. By fitting this data set, the model learns the new concept while reviewing previously learned concepts. Hence, forgetting and mixing concepts is avoided (Figure 5).
The following figures and tables show the results of the articles:
This work concludes that personalized diffusion is an efficient method for
Augment existing text-to-image models. You can quickly acquire a new concept with just a few examples and compose multiple concepts together in novel environments. The authors found that optimizing just a few model parameters was sufficient to represent these new concepts while remaining memory and computationally efficient.
However, there are some limitations of the pretrained models that the fitted model inherits. As shown in Figure 11, Resistant compositions, for example, A stuffed turtle and a teddy bear.It’s still a challenge. Furthermore, composing three or more concepts is also troublesome. Addressing these limitations may be a future direction for research in this field.
Vineet Kumar is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree at the Indian Institute of Technology (IIT), Kanpur. He is a machine learning enthusiast. He is passionate about research and the latest advances in Deep Learning, Computer Vision and related fields.