A new era of photorealistic image synthesis has just begun thanks to the development of generative text-to-image (T2I) models such as DALLE 2, Imagen, and Stable Diffusion. This has significantly influenced many subsequent applications, including image editing, video production, creating 3D assets, etc. However, these sophisticated models require significant processing power to train. For example, training SDv1.5 requires 6K days of A100 GPU, which costs around $320,000. The largest and most current model, RAPHAEL, even requires 60,000 days of A100 GPU, which costs around $3,080,000. Furthermore, training causes significant CO2 emissions that put stress on the environment; For example, the formation of RAPHAEL produces 35 tons of CO2 emissions, the same amount of emissions that a person emits for 7 years, as seen in Figure 1.
Figure 1: Shown here are comparisons of CO2 emissions and training costs between T2I producers. A remarkable $26,000 is spent on training for PIXART-α. Our CO2 emissions and training costs are only 1.1% and 0.85% lower than RAPHAEL.
Such a high price creates significant restrictions on obtaining such models for both the research community and companies, significantly impeding critical progress for the AIGC community. A crucial question arises regarding these difficulties: can they create a high-quality image generator with manageable resource usage? Researchers from Huawei Noah’s Ark Lab, Dalian University of technology, HKU and HKUST introduce PIXART-α, which dramatically reduces training computing requirements while maintaining competitive imaging quality against state-of-the-art imagers generation. They suggest three main designs for this: Decomposition of the training plan. They divide the challenging text-to-image production problem into three simple subtasks:
- Learning the distribution of pixels in natural images.
- Learn text-image alignment
- Improve the aesthetic appeal of images.
They suggest drastically reducing the learning cost for the first subtask by initializing the T2I model with a low-cost class condition model. They provide a training paradigm consisting of pre-training and fine-tuning for the second and third subtasks: pre-training on text-image pair data with high information density, followed by fine-tuning on data with higher aesthetic quality, which increases the effectiveness of training. a productive T2I transformer. They use cross-attention modules to inject text conditions and simplify the computationally demanding class condition branch to increase diffusion transformer (DiT)-based efficiency. Furthermore, they present a reparametrization method that allows the modified text-to-image model to directly import the parameters of the original class condition model.
In this way, they can use ImageNet’s prior knowledge about the natural distribution of images to provide the T2I transformer with an acceptable initialization and speed up its training. High quality information. Their research reveals significant flaws in existing text-image pair datasets, with LAION as an example. Textual captions often suffer from a severe long-tail effect (i.e., many nouns appear at extremely low frequencies) and a lack of informational content (i.e., they generally describe only a portion of the objects in the images). These failures greatly reduce the training efficiency of the T2I model and require millions of iterations to obtain reliable text and image alignments. They suggest an automatic tagging process that uses the most advanced vision and language model to produce subtitles in the SAM to overcome these problems.
The SAM dataset has the advantage of having a large and diverse collection of objects, making it a perfect source for producing text-image pairs with a high information density that are better suited for text alignment learning. -image. Its smart features allow your model training to be extremely efficient, using only 675 days of A100 GPU and $26,000. Figure 1 shows how their approach uses less training data volume (0.2% vs. Image) and less training time (2% vs. RAPHAEL) than Image. Their training expenses are approximately 1% of RAPHAEL’s, saving them about $3,000,000 ($26,000 vs. $3,080,000).
In terms of generation quality, their user research tests show that PIXART-α offers better image quality and semantic alignment than current SOTA T2I models, Stable Diffusion, etc.; Furthermore, its performance in T2I-CompBench demonstrates its advantage in semantic control. They anticipate that their efforts to train T2I models effectively will provide the AIGC community with useful information and help more academics or independent companies produce their own high-quality T2I models at more affordable prices.
Review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 31k+ ML SubReddit, Facebook community of more than 40,000 people, Discord Channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
We are also on WhatsApp. Join our ai channel on Whatsapp.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Data Science and artificial intelligence at the Indian Institute of technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around it. She loves connecting with people and collaborating on interesting projects.
<!– ai CONTENT END 2 –>