Diffusion models have proven to be very successful in producing high-quality photographs when given text suggestions. This paradigm for text-to-image (T2I) production has been successfully used for several downstream applications, including depth image generation and subject identification/segmentation. Two popular text-conditioned diffusion models, CLIP models and latent diffusion models (LDM), often called stable diffusion, are essential to these developments. LDM is well known in research for being freely available as open source software. UnCLIP models, on the other hand, have received little attention. The basic goal of both types of models is to train diffusion models in response to text signals.
Unlike the unCLIP models, which include a text-to-image prior and a broadcast image decoder, the LDM has a single text-to-image broadcast model. Both families of models operate within the latent quantized vector space of the image. Because unCLIP models often outperform other SOTA models on various composition benchmarks, such as T2I-CompBench and HRS-Benchmark, the research team focuses on them in this article. These T2I models, which tend to have many parameters, need excellent image-text pairs for training. Compared to LDMs, unCLIP models such as DALL-E-2, Karlo, and Kandinsky have a substantially larger total model size (≥ 2B) due to their upstream module, which has around 1 billion parameters.
In that order, the training data for these unCLIP models are 250M, 115M, and 177M image-text pairs. Therefore, two important questions remain: 1) Does SOTA improve performance on text compositions using an older version of text-to-image? 2) Or is increasing the model size the crucial element? By increasing parameter and data efficiency, the research team aims to improve its understanding of the antecedents of T2I and deliver significant improvements over current formulations. Previous T2Is, intended to directly estimate noise-free image embedding at each step of the diffusion process, are also diffusion models, as suggested by previous research. To examine this earlier diffusion process, the research team conducted empirical research.
The research team found that the diffusion process marginally degrades performance and has no effect on producing correct images. Additionally, because diffusion models converge more slowly, training them takes significant GPU hours or days. As a result, the non-diffusion model serves as a surrogate in this study. Due to the lack of classifier-free guidance, this method may limit composition possibilities, but greatly improves parameter efficiency and reduces data dependence.
In this study, the Arizona State University research team presents a unique contrastive learning technique, called ECLIPSE, to improve the prior non-diffusion of T2I and overcome the above drawbacks. The research team improved the conventional approach of producing image embedding from the provided embedded text by optimizing the lower bound of evidence (ELBO). The research team suggests using the semantic alignment feature (between text and image) of pre-trained vision and language models to monitor previous training. The research team uses a relatively small fraction of the image-text pairs (0.34% – 8.69%) to train compact (97% smaller) diffusion-free prior models (with 33 million parameters) using ECLIPSE. The research team presented ECLIPSE background for variations of the unCLIP diffusion image decoder (Karlo and Kandinsky). Preprograms trained with ECLIPSE outperform their billion-parameter counterparts and outperform baseline prelearning algorithms. Their findings suggest a possible path for T2I generative models that improve compositionality without requiring many parameters or data.
As shown in Fig. 1, their total parameter and data needs decrease significantly and achieve SOTA performance compared to similar parameter models by increasing T2I before unCLIP families. Contributions. 1) In the unCLIP framework, the research team provides ECLIPSE, the first effort to use contrastive learning for earlier text-to-image versions. 2) The research team demonstrated the superiority of ECLIPSE over previous benchmarks in resource-limited contexts through comprehensive experimentation. 3) It is noteworthy that ECLIPSE priors require only 2.8% of the training data and 3.3% of the model parameters to obtain performance equivalent to larger models. 4) The research team also examines the disadvantages of the current T2I diffusion background and provides empirical observations.
Review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don't forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord Channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you'll love our newsletter.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor's degree in Data Science and artificial intelligence at the Indian Institute of technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around it. She loves connecting with people and collaborating on interesting projects.
<!– ai CONTENT END 2 –>