This Google AI white paper presents a set of optimizations that collectively achieve breakthrough latency numbers for running broad-based models across multiple devices.

Model size and inference workloads have grown dramatically as large spread models for image production have become more common. Due to resource constraints, optimizing performance for on-device ML inference in mobile contexts is a delicate balancing act. Due to the considerable memory requirements and computational demands of these models, running LDM inference on devices poses even greater hurdles, especially in light of the need for cost effectiveness and user privacy.

The rapid creation and widespread use of basic models have completely transformed artificial intelligence. Due to their versatility and ability to produce photorealistic images, the widely distributed models have garnered much attention. Reduced server costs, offline capabilities, and improved user privacy are just a few of the benefits of deploying these models locally on the user’s device. Due to limited computing and memory resources in the devices, typical wide-spread models have more than a billion parameters, which poses challenges. Google researchers offer a set of modifications to the widely used model implementation that enable the fastest inference latency on GPU-powered mobile devices to date. These updates improve the overall user experience across multiple devices and increase the scope of use of generative AI.

Due to its many benefits over server-based methods, such as lower latency, higher privacy, and higher scalability, on-device model inference acceleration has attracted a lot of interest recently. The complexity of the softmax operation frequently used in deep learning has motivated optimization efforts, resulting in several different speedup strategies. Winograd Convolution was developed to improve the efficiency of convolutional computation by minimizing the number of multiplications required, which is especially useful for graphics processing units (GPUs).

🚀 JOIN the fastest ML subreddit community

The widespread success and adoption of the Transformer design has sparked research into speeding up the attention mechanism. Reformer uses a sparse approximation to reduce computational cost, while other works use low-rank approximation techniques or a combination of techniques. FlashAttention, on the other hand, is a precise attention algorithm that considers hardware configurations to achieve better performance.

The main focus is on the challenge of creating images from written descriptions using mass diffusion models. Although this discussion focuses on how the proposed enhancements work with the stable broadcast architecture, it is important to note that these optimizations are easily transferable to other broad broadcast models. Text inference requires additional conditioning based on the desired textual description to drive the backdiffusion process.

The attention block used extensively by the denoising model in the LDM presents a major area for improvement. The model can zero out the relevant information by giving the attention blocks more weight in the input. Care modules can be optimized in several ways; researchers often use only one of the two optimizations listed below, whichever produces the best results.

The first optimization, called partially merged softmax, reduces the amount of memory read and written during the attention module softmax by merging it with matrix multiplication. The other tweak uses an I/O-aware precise attention method called FlashAttention. The number of high-bandwidth memory accesses from the GPU is reduced with this approach, making it an excellent choice for applications with restricted memory bandwidth. It takes a lot of registering and they found that the method only works with specific sizes of SRAM. Therefore, they only use this method on a subset of GPUs for attention matrices of a particular size.

In addition, the team found that the merge windows for layers and units commonly used in LDM need to be substantially larger on a mobile GPU than is currently available on commercially available GPU-accelerated ML inference engines. In light of the limitations of the standard merge rules, they devised custom implementations capable of executing a wider variety of neural operators. His attention turned to two subfields: Gaussian Error Linear Unit (GELU) and Group Normalization Layer.

Model file size limitations, massive runtime memory requirements, and long inference latency have proven to be major obstacles when doing ML inferences of large models on the device itself. The researchers realized that memory bandwidth usage was the main limitation. Therefore, they focused on improving memory bandwidth utilization while maintaining a healthy ALU/memory efficiency ratio. Together, the optimizations they demonstrated enabled wide-spread models to run on a wide range of devices with unprecedented latency values. Thanks to these improvements, the applicability of the model is extended and the user experience is improved in a wide range of devices.

review the Paper and Google AI Article. Don’t forget to join our 24k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]

featured tools Of AI Tools Club

🚀 Check out 100 AI tools at AI Tools Club

Dhanshree Shenwai is a Computer Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking domain with strong interest in AI applications. She is enthusiastic about exploring new technologies and advancements in today’s changing world, making everyone’s life easier.

➡️ Meet Notion: your wiki, documents and projects together