Text-to-image generation models are recently revolutionizing artificial intelligence (AI) and the way creative image synthesis is done. They use powerful language models to understand text input prompts and convert them into manageable multidimensional structures called tokens, which contain all the essential information contained in the given text.
Large text models like CLIP use these tokens with a contrastive learning goal for multimodal retrieval tasks, which involve finding semantically relevant matches between text and images. CLIP exploits vast data sets of image and text pairs to learn the relationships between image and text captions. Well-established diffusion models, such as Stable Diffusion, DALL-E, or Midjourney, use CLIP for semantic awareness in the diffusion process, which is the sequence of joint procedures for adding noise to an image and removing noise to retrieve a more accurate display. . .
From these complex models, simpler but still powerful solutions can be derived through Score Distillation Samples (SDS). SDS involves training a smaller model to predict the scores (or log probabilities) assigned to images by a larger pretrained model, which serves as a guide to the estimation process.
Although very powerful and effective in simplifying complex diffusion models, SDS suffers from synthesis artifacts. One of the main problems associated with SDS is mode collapse, which describes its tendency to converge towards specific modes. This often leads to fuzzy output, capturing only the items explicitly described in the notice, as in Figure 2.
In this perspective, a new information distillation technique has been proposed, called Delta Distillation Score (DDS). The name of this technique comes from the way the distillation score is calculated. Unlike SDS, which queries the generative model with a pair of image and text, DDS uses an additional query of a reference pair, where the text matches the image content.
The score is the difference, or delta, between the results of the two queries.
The basic form of DDS requires two image-text pairs, one is the reference and does not change during optimization, and the other represents the optimization target, which must match the target text message. DDS leads to effective gradients, which consider the edited areas of the image and leave the rest untouched.
In DDS, the source image and its text captions help estimate noisy and undesirable gradient directions introduced by SDS. In detailed or partial editing of the image with a new text description, reference estimation helps to get a cleaner gradient direction to update the image.
Additionally, DDS can modify images by changing their textual descriptions without the need to compute or provide a visual mask. Furthermore, it allows training of an image-to-image model without the need for paired training data, resulting in a zero-shot image translation method. According to the authors, this zero-shot training technique can be used for single- and multi-task image translation. In addition, the source distribution can include both authentic and synthetically generated images.
Below is an image to compare the difference in performance between DDS and more advanced approaches for image-to-image translation.
This was a summary of the Delta Denoising Score, a new AI technique to provide faithful, clean and detailed image-to-image and text-to-image synthesis. If you are interested, you can learn more about this technique at the links below.
review the Paper and project page. Don’t forget to join our 20k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He currently works at the Christian Doppler ATHENA Laboratory and his research interests include adaptive video streaming, immersive media, machine learning and QoS / QoE evaluation.