The emergence of large language models (LLMs) has inspired several uses, including the development of chatbots such as ChatGPT, email assistants, and coding tools. Much work has been done to improve the efficiency of these models for large-scale implementation. This has made it easier for ChatGPT to serve over 100 million active users weekly. However, it must be taken into account that text generation represents only a fraction of the possibilities of these models.
The unique characteristics of Text to Image (TTI) and Text to Video (TTV) models mean that these evolving tasks experience different advantages. Consequently, a thorough examination is necessary to identify areas to optimize TTI/TTV operations. Despite notable algorithmic advances in image and video generation models in recent years, there has been comparatively limited effort to optimize the implementation of these models from a systems perspective.
Researchers from Harvard University and Meta take a quantitative approach to outline the current landscape of text-to-image (TTI) and text-to-video (TTV) models by examining several design dimensions, including latency and computational intensity. To achieve this, they create a suite comprising eight representative tasks for text-to-image and video generation, comparing them to widely used language models such as LLaMA.
They find notable distinctions, demonstrating that new system performance limitations emerge even with state-of-the-art performance optimizations like Flash Attention. For example, convolution accounts for up to 44% of the execution time in diffusion-based TTI models, while linear layers consume up to 49% of the execution time in transformer-based TTI models.
Furthermore, they find that the bottleneck related to temporal attention increases exponentially with increasing frames. This observation highlights the need for future system optimizations to address this challenge. They develop an analytical framework to model changing memory and FLOP requirements throughout the advancement of a diffusion model.
Large language models (LLMs) are defined by a sequence that denotes the extent of information that the model can consider, indicating the number of words it can take into account when predicting the next word. However, in state-of-the-art text-to-image (TTI) and text-to-video (TTV) models, the length of the sequence is directly influenced by the size of the image being processed.
They conducted a case study on the Stable Diffusion model to more concretely understand the impact of scaling image size and demonstrate the sequence length distribution for Stable Diffusion inference. They find that after applying techniques like Flash Attention, Convolution has a higher scale dependence on image size than Attention.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to join. our SubReddit of more than 35,000 ml, 41k+ Facebook community, Discord channel, LinkedIn Groupand Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you'll love our newsletter.
Arshad is an intern at MarktechPost. He is currently pursuing his international career. Master's degree in Physics from the Indian Institute of technology Kharagpur. Understanding things down to the fundamental level leads to new discoveries that lead to the advancement of technology. He is passionate about understanding nature fundamentally with the help of tools such as mathematical models, machine learning models, and artificial intelligence.
<!– ai CONTENT END 2 –>