Recent developments in Large Language Models (LLM) have shown their impressive problem-solving capabilities in various fields. LLMs can include hundreds of billions of parameters and are trained on huge corpora of text.
Studies show that in LLM inference, memory bandwidth, not CPU, is the key performance limitation for generative tasks. This indicates that the speed at which parameters can be loaded and stored for memory-bound situations, rather than arithmetic operations, becomes the key latency barrier. However, progress in memory bandwidth technology has lagged far behind computing, giving rise to a phenomenon known as the memory wall.
Quantization is a promising method that consists of storing model parameters with less precision than the usual 16 or 32 bits used during training. Despite recent advances such as LLaMA and its instruction tracking variations, it is still difficult to achieve good quantization performance, especially with lower bit precision and relatively modest models (eg, 50B parameters).
A new UC Berkeley study digs deeper into low-bit precision quantization to reveal the shortcomings of current methods. Based on these findings, the researchers present SqueezeLLM, a post-training quantization framework that combines a Dense-and-Sparse decomposition technique with a unique sensitivity-based non-uniform quantization strategy. These methods enable quantization with ultra-low bit precision while retaining competitive model performance, drastically reducing model size and inference time costs. Their method reduces the perplexity of the LLaMA-7B model with a 3-bit precision of 28.26 with uniform quantization to 7.75 on the C4 dataset, which is a considerable improvement.
Through extensive testing on the C4 and WikiText2 benchmarks, the researchers found that SqueezeLLM consistently outperforms existing quantization approaches by a wide margin at different bit precisions when applied to LLaMA-7B, 13B, and 30B for quantization tasks. language modeling.
According to the team, low-bit precision quantization of many LLMs is particularly difficult due to substantial outliers in the weight matrices. These outliers also affect your non-uniform quantization approach, as they skew the bit allocation towards extremely high or low values. To remove outliers, they provide a straightforward method that splits the model weights into dense and sparse components. By isolating outliers, the central region shows a narrower range down to 10, resulting in better quantization accuracy. With efficient sparse storage methods such as Compressed Sparse Rows (CSR), sparse data can be maintained with complete precision. This method incurs low overhead by using efficient sparse kernels for the sparse half and parallelizing the computation along with the thick part.
The team demonstrates the quantization potential of their framework’s IF models by applying SqueezeLLM to the Vicuna-7B and 13B models. They compare two systems in their tests. To begin, they use the MMLU dataset, a multi-tasking benchmark that measures a model’s knowledge and problem-solving ability, to measure the quality of the output generated. They also use GPT-4 to classify the generation quality of the quantified models in relation to the FP16 baseline, using the evaluation methodology presented in Vicuña. In both benchmarks, SqueezeLLM regularly outperforms GPTQ and AWQ, two current cutting-edge approaches. In particular, in both evaluations, the 4-bit quantized model performs as well as the baseline.
The paper shows significant latency reductions and quantization performance gains with its models running on A6000 GPUs. The researchers demonstrate accelerations of up to 2.3 compared to the baseline inference from FP16 for LLaMA-7B and 13B. Furthermore, the proposed method achieves latency up to 4 times faster than GPTQ, demonstrating its effectiveness in quantization performance and inference efficiency.
review the Paper and Github. Don’t forget to join our 24k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
featured tools Of AI Tools Club
🚀 Check out 100 AI tools at AI Tools Club
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.