The rapid scaling of diffusion models has created challenges in memory usage and latency, making them difficult to implement, particularly in resource-constrained environments. These models have demonstrated an impressive ability to generate high-fidelity images, but are demanding in both memory and computation, which limits their availability in consumer devices and applications that require low latencies. Therefore, these challenges need to be addressed to make it feasible to train large-scale diffusion models on a large multiplicity of platforms in real time.
Current techniques for solving memory and speed problems of diffusion models include post-training quantization and quantization-aware training, primarily with weight-only quantization methods such as NormalFloat4 (NF4). While these methods work well for language models, they are not sufficient for diffusion models due to their higher computational requirements. Unlike language models, diffusion models require simultaneous low-bit quantization of both weights and activations to avoid performance degradation. Existing methods for quantization are affected by the presence of outliers in both weights and activations with 4-bit precision and contribute to compromised visual quality along with computational inefficiencies, justifying a more robust solution.
Researchers from MIT, NVIDIA, CMU, Princeton, UC Berkeley, SJTU, and Pika Labs propose SVDQuant. This quantization paradigm introduces a low-rank branch to absorb outliers, facilitating effective 4-bit quantization for diffusion models. Using creative SVD to deal with outliers, SVDQuant would transfer it from the activations to the weight and then absorb it into a low-rank branch allowing the residual to be quantized to 4 bits without loss of performance and avoiding a common bug related to outliers. . further optimization of the quantification process without additional requantization. Scientists developed an inference engine called Nunchaku that combines low-rank, low-bit computing cores with memory access optimization to reduce latency.
SVDQuant works by smoothing and outputting outliers from activations to weights. Then, applying SVD decomposition on the weights, split the weights into a low and residual range. The low-rank component would absorb the outliers with 16-bit precision, while the residual would be quantized with 4-bit precision. The Nunchaku inference engine optimizes this further by allowing low-rank and low-bit branches to join together, thereby merging input and output dependencies, resulting in reduced memory access and subsequently reduced latency. . Surprisingly, evaluations of models such as FLUX.1 and SDXL, using data sets such as MJHQ and sDCI, reveal huge memory savings of 3.5x and latency savings of up to 10.1x on portable devices. For example, SVDQuant's application reduces the 12 billion parameter FLUX.1 model from 22.7 GB to 6.5 GB, avoiding CPU offload in memory-limited configurations.
The SVDQuant outperformed state-of-the-art quantification methods in both efficiency and visual fidelity. For 4-bit quantization, SVDQuant consistently shows high perceptual similarity accompanied by high-quality numerical constructs that can be retained for any imaging task with consistent outperformance over its competitors, such as NF4, in terms of distance. Fréchet startup, ImageReward, LPIPS, and PSNR scores on multiple diffusion model architectures and, for example, compared to the FLUX.1-dev model, the SVDQuant configuration is well tuned on LPIPS scores closely aligned with the 16-bit baseline, while saving 3.5x model size and achieving around 10.1x speedup on GPU devices without offloading the CPU. This efficiency supports the real-time generation of high-quality images on memory-limited devices and underlines the effective practical deployment of widespread models.
In conclusion, the proposed SVDQuant approach employs advanced 4-bit quantization; Here, the outlier problems found in the diffusion model are addressed while maintaining image quality, with significant reductions in memory and latency. Optimizing quantization and elimination of redundant data movement using the Nunchaku inference engine forms the basis for the efficient implementation of large diffusion models and thus drives their potential use in real-world interactive applications on hardware. consumption.
look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 55,000ml.
(ai Magazine/Report) Read our latest report on 'SMALL LANGUAGE MODELS'
Aswin AK is a consulting intern at MarkTechPost. He is pursuing his dual degree from the Indian Institute of technology Kharagpur. He is passionate about data science and machine learning, and brings a strong academic background and practical experience solving real-life interdisciplinary challenges.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>