The development of image synthesis techniques has experienced a notable boom in recent years, arousing great interest in academia and industry. The text-to-image generation and Stable Diffusion (SD) models are the most used developments in this field. Although these models have demonstrated notable capabilities, they can currently only produce images with a maximum resolution of 1024 x 1024 pixels, which is insufficient to meet the requirements of high-resolution applications such as advertising.
Problems arise when trying to generate images larger than these training resolutions, mainly with repeating objects and warped object architectures. Object duplication becomes more problematic as image size increases if a Stable Diffusion model is used to generate images with dimensions of 512 × 512 or 1024 x 1024, after being trained on 512 x 512 images .
In the resulting graphs, these problems primarily show up as object duplication and incorrect object topologies. Existing methods for creating higher resolution images, such as those based on co-diffusion techniques and attention mechanisms, struggle to adequately address these issues. Researchers have examined the structural elements of the U-Net architecture in diffusion models identifying a crucial element causing the problems, which is the restricted perceptual fields of the convolutional kernels. Basically, problems like object recurrence arise because the model’s convolutional procedures have a limited ability to see and understand the content of the input images.
A team of researchers has proposed ScaleCrafter for higher resolution visual generation at inference time. It uses redilation, a simple but incredibly powerful solution that allows models to handle higher resolutions and variable aspect ratios more effectively by dynamically adjusting the convolutional perceptual field throughout the image production process. The model can improve the consistency and quality of generated images by dynamically adjusting the receptive field. The work presents two additional advances: sparse convolution and noise-buffered classifierless guidance. With this, the model can produce ultra-high resolution photographs, up to 4096 by 4096 pixel dimensions. This method does not require any additional training or optimization steps, making it a viable solution to the structural and repetition problems of high-resolution image synthesis.
Extensive testing was carried out for this study, which demonstrated that the suggested method successfully addresses the problem of object repetition and delivers state-of-the-art results in producing images with higher resolution, especially excelling in displaying complex texture details. . This work also sheds light on the possibility of using diffusion models that have already been trained on low-resolution images to generate high-resolution images without requiring much retraining, which could guide future work in the field of ultra-high imaging. resolution. and video synthesis.
The main contributions have been summarized as follows.
- The team has found that, rather than the number of attention tokens, the main cause of object repetition is the restricted receptive field of convolutional procedures.
- Based on these findings, the team proposed a redilation approach that dynamically increases the convolutional receptive field while inference is carried out, addressing the root of the problem.
- Two innovative strategies have been presented: sparse convolution and noise-buffered classifierless guidance, specifically designed to be used in ultra-high resolution imaging.
- The method was applied to a text-to-video model and was extensively evaluated on a variety of diffusion models, including different iterations of Stable Diffusion. These tests include a wide range of aspect ratios and image resolutions, showing the effectiveness of the model in addressing the problem of object recurrence and improving high-resolution image synthesis.
Review the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 31k+ ML SubReddit, Facebook community of more than 40,000 people, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
We are also on WhatsApp. Join our ai channel on Whatsapp.
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.
<!– ai CONTENT END 2 –>