tThe most commonly used metric to define ai performance is TOP (Tera Operations Per Second), which indicates computing power but oversimplifies the complexity of ai systems. When it comes to real ai use case system design, many other factors beyond the TOPs must also be considered, including memory/cache size and bandwidth, data types, energy efficiency , etc.
Furthermore, each ai use case has its characteristics and requires a holistic examination of the entire use case process. This exam delves into its impact on system components and explores optimization techniques to predict the best pipeline performance.
In this post, we choose an ai use case: an end-to-end real-time infinite zoom function with a stable diffusion inpainting model v2 and study how to build a corresponding ai system with the best performance/watts. This can serve as a proposal, both with well-established technologies and with new research ideas that can lead to possible architectural features.
Background on end-to-end video zoom
- As shown in the diagram below, to zoom out the video frames (fish image), we resize and apply an edge mask to the frames before feeding them into the paint stable diffusion channel. Along with an input text message, this pipeline generates frames with new content to fill the border-masked region. This process is continuously applied to each frame to achieve the continuous zoom out effect. To conserve computing power, we can sparsely sample video frames to avoid painting every frame(for example, output 1 frame every 5 frames) if it still provides a satisfactory user experience.
- Stable diffusion-v2 in paint The pipeline is pre-trained on the stable diffusion model-2, which is a text-to-image latent diffusion model created by stability ai and LAION. The blue box in the diagram below shows each function block in the painting process.
- The stable diffusion model-2 generates images with a resolution of 768*768 and is trained to remove random noise iteratively (50 steps) to obtain a new image. The denoising process is implemented by Unet and the scheduler, which is a very slow process and requires a lot of computation and memory.
There are 4 models used in the pipeline as shown below:
- VAE (image encoder). Convert image to low-dimensional latent representation (64*64)
- CLIP (text encoder). Transformative architecture (77*768), 85MP
- UNet (diffusion process). Iterative denoising processing using a programmed algorithm, 865M
- VAE (image decoder). Transforms the latent representation back into an image (512*512)
The most stable broadcast operations (98% of the autoencoder and text encoder models and 84% of U-Net) are convolutions. Most of the remaining U-Net operations (16%) are dense matrix multiplications due to self-attention blocks. These models can be quite large (they vary with different hyperparameters), requiring a lot of memory; For mobile devices with limited memory, it is essential to explore model compression techniques to reduce model size, including quantization (mode size reduction from 2 to 4x and 2-3x speedup from FP16 to INT4), pruning, sparsity, etc
Optimizing energy efficiency for ai features such as end-to-end video zoom
For ai features like video zoom, power efficiency is one of the main factors for successful implementation on mobile/edge devices. These battery-operated peripheral devices store their energy in the battery, with a capacity of mW-H (milliwatt hours, 1200 WH means 1200 watts in an hour before discharging; if an application consumes 2 watts in an hour, then the battery can power the device for 600h). Energy efficiency is calculated as IPS/Watt, where IPS is inferences per second (FPS/Watt for image-based applications, TOPS/Watt).
Reducing power consumption is essential to achieve longer battery life for mobile devices; There are many factors that contribute to high power usage, including large amounts of memory transactions due to large model size, intensive calculations of matrix multiplications, etc., let's take a look at how to optimize the use case for efficient use of the energy.
- Model optimization.
Beyond quantification, pruning and scarcity, there is also weight distribution. There are many redundant weights in the network, although only a small number of weights are useful. The number of weights can be reduced by allowing multiple connections to share the same weight, as shown below. The original 4*4 weight matrix is reduced to 4 shared weights and a 2-bit matrix, the total bits are reduced from 512 bits to 160 bits.
2. Memory optimization.
Memory is a critical component that consumes more power compared to matrix multiplications. For example, the power consumption of a DRAM operation can be much higher than that of a multiply operation. On mobile devices, accommodating large models within local device memory is often a challenge. This results in numerous memory transactions between local device memory and DRAM, resulting in higher latency and higher power consumption.
Optimizing off-chip memory access is crucial to improving power efficiency. Article (Optimizing off-chip memory access for a deep neural network accelerator (4)) introduced an adaptive scheduling algorithm designed to minimize DRAM access. This approach demonstrated a substantial reduction in power consumption and latency, ranging from 34% to 93%.
A new method (novels (5)) is proposed to minimize memory access to save energy. The core idea is to optimize the correct block size of the CNN layer partition to match the DRAM/SRAM resources and maximize data reuse, and also optimize the tile access scheduling to minimize the amount of DRAM access. . Data-to-DRAM mapping focuses on mapping a tile of data to different columns in the same row to maximize row buffer accesses. For larger data mosaics, the same bank can be used on different chips to achieve chip-level parallelism. Additionally, if the same row is populated on all chips, data is mapped across different banks on the same chip to achieve bank-level parallelism. For SRAM, a similar concept of parallelism can be applied at the bench level. The proposed optimization flow can save energy by 12% for AlexNet, 36% for VGG-16, and 46% for MobileNet. Below is a high-level flowchart of the proposed method and a schematic illustration of DRAM data mapping.
3. Dynamic power scaling.
The power of a system can be calculated by P=C*F*V², where F is the operating frequency and V is the operating voltage. Techniques such as DVFS (Dynamic Voltage Frequency Scaling) were developed to optimize power at runtime. Scales voltage and frequency depending on workload capacity. In deep learning, layered DVFS is not appropriate since voltage scaling has long latency. On the other hand, frequency scaling is fast enough to keep up with each layer. TO layered dynamic frequency scaling (DFS)(6) A technique for NPU is proposed, with a power model to predict power consumption to determine the highest allowed frequency. DFS is proven to improve latency by 33% and save energy by 14%
4. Low-power dedicated ai HW accelerator architecture. To accelerate deep learning inference, specialized ai accelerators have demonstrated superior energy efficiency, achieving similar performance with reduced power consumption. For example, Google's TPU is designed to speed up matrix multiplication by reusing input data multiple times for calculations, unlike CPUs that fetch data for each calculation. This approach conserves energy and decreases data transfer latency.