Large-scale text-to-image (T2I) diffusion models, which aim to generate images conditioned to a given text/indication, have experienced rapid development thanks to the availability of large amounts of training data and enormous computing power. . However, this generative capacity is usually varied, which makes it difficult to develop adequate indicators to generate images compatible with what the user has in mind and their subsequent modification based on existing images.
Image editing has more varied requirements than image creation. Since the latent space is small and easy to manipulate, GAN-based methods have found wide application in image editing. Diffusion models are more stable and generate better quality results than GAN models.
A new research paper from Peking University and ARC Lab, Tencent PCG, aims to determine whether the diffusion model can have the same drag capabilities.
The fundamental difficulty in implementing this requires a compact and writable latent space. Many diffusion-based image editing approaches have been developed based on the similarity between these intermediate text and image properties. Studies find a strong local similarity between word and object features on the cross-attention map, which can be used in editing.
While there is a strong correlation between text features and intermediate image features in the large-scale T2I broadcast generation process, there is also a strong mapping between intermediate image features. This feature has been investigated in DIFT, demonstrating that the correspondence between these features is of a high degree and allowing direct comparison of similar regions between images. Due to this high similarity between the image elements, the team employs this method to achieve the image modification.
To adapt the intermediate representation of the diffusion model, the researchers design a classifier-guide-based strategy called DragonDiffusion that converts editing signals into feature mismatch gradients. The proposed approach for diffusion uses two groups of functions (ie, guide functions and generation functions) at different stages. With a strong mapping of image features as a guide, they review and refine the generation features based on the guide features. The strong correspondence between image features also helps preserve content consistency between the altered image and the original.
In this context, the researchers also discover that another work called Drag-Diffusion investigates the same topic simultaneously. It uses LORA to keep things as they were initially and improves the editing process by streamlining a single intermediate step in the broadcast procedure. Instead of fitting or training the model, as with DragDiffusion, the method proposed in this paper is based on classifier guidance, with all editing and content consistency signals coming directly from the image.
DragonDiffusion derives all modification and content preservation signals from the original image. Without additional model adjustments or training, the ability to create T2I on diffusion models can be transferred directly to image editing applications.
Extensive testing shows that the proposed DragonDiffusion can produce a wide range of fine-grained image alteration tasks, such as resizing and repositioning objects, changing their appearance, and dragging their contents.
review the Paper and GitHub link. Don’t forget to join our 25k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Dhanshree Shenwai is a Computer Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking domain with strong interest in AI applications. She is enthusiastic about exploring new technologies and advancements in today’s changing world, making everyone’s life easier.