Stability AI has partnered with its DeepFloyd AI research lab to introduce the research version of its latest technology, called DeepFloyd IF. This text-to-image cascading pixel diffusion model is designed to generate high-quality images from text inputs. The model is available under a research-enabled non-commercial license, allowing research labs to explore and experiment with advanced text-to-image generation methods. The release of this model aligns with Stability AI’s commitment to share innovative technologies with the broader research community. The company plans to release the DeepFloyd IF model fully open source over time.
The recently released DeepFloyd IF model boasts several impressive features. First, use the T5-XXL-1.1 language model as a text encoder to help understand text prompts. The model also employs cross-attention layers to better align the text message and the generated image. One of the most prominent features of the DeepFloyd IF model is its ability to accurately apply text descriptions to generate images with multiple objects appearing in different spatial relationships. This has previously been a challenging task for other text-to-image models. Another noteworthy feature is the high degree of photorealism in the generated images, which is reflected in the model’s impressive zero shot FID score of 6.66 on the COCO data set. The DeepFloyd IF model can also output images with non-standard aspect ratios, including portrait or landscape orientations and the standard square aspect.
In addition to text-to-image generation, the DeepFloyd IF model offers trigger-free image-to-image translations. This is accomplished by resizing the original image to 64 pixels, adding noise via forward diffusion, and using back diffusion with a new prompt to remove noise from the image. Styling can be modified through super resolution modules via a quick text description. This approach allows you to modify the style, patterns, and details in the output image while maintaining the main shape of the source image without the need for adjustments.
The DeepFloyd IF model works in three stages to generate high-quality images from text prompts. A T5-XXL frozen language model converts the text message into a qualitative representation in the first stage. Then, in the second stage, a base diffusion model is applied to transform the qualitative text into a 64 × 64 image, which is then scaled to 256 × 256 using two text conditional superresolution models. During the third stage of the process, a final model is used to enhance the image to a clear, high-quality 1024×1024 resolution. The IF model includes different versions of the base and super-resolution models, which have other parameters. Although the third stage model is not yet available, alternative, larger scale models such as the Stable Diffusion x4 Upscaler can be used.
The DeepFloyd IF model was trained on a high-quality custom dataset called LAION-A, which contains 1 billion (image, text) pairs. The dataset is an aesthetic subset of the English part of the LAION-5B dataset, and the data was filtered using custom filters to remove inappropriate content. The model is initially released under a research license and the developers welcome feedback to improve the performance and scalability of the model. The model can be used in various domains such as art, design, storytelling, virtual reality, and accessibility. The creators pose several research questions related to the technical, academic, and ethical aspects of the model. Access to model weights is available from Deep Floyd’s Embracing the face spaceand the model of card and code are also available in GitHub. TO Create a demo is provided for everyone, and the creators invite people to join public debates.
Don’t forget to join our 20k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Niharika is a technical consulting intern at Marktechpost. She is a third year student, currently pursuing her B.Tech from the Indian Institute of Technology (IIT), Kharagpur. She is a very enthusiastic individual with a strong interest in machine learning, data science, and artificial intelligence and an avid reader of the latest developments in these fields.