Meet BiLLM: a novel post-training binary quantization method designed specifically to compress pre-trained LLMs

Pretrained large language models (LLMs) have remarkable language processing capabilities, but require substantial computational resources. Binarization, which reduces model weights to a single bit, offers a solution by dramatically reducing computing and memory demands. However, existing quantization techniques should help maintain LLM performance at such low bitwidths. This challenges achieving efficient implementation of LLMs while maintaining effectiveness in various language processing tasks.

Recent work has highlighted the exceptional performance of LLMs such as OPT and LLaMA on various benchmarks, but their implementation on memory-limited devices remains a challenge. Model quantization, particularly post-training quantization (PTQ), effectively compresses LLMs, saving GPU memory consumption. While PTQ methods have been successful in 8- and 4-bit quantization, the increasing size of LLMs requires more aggressive approaches such as neural network binarization. However, existing PTQ methods face performance collapse under ultra-low bit quantization.

Researchers from the University of Hong Kong, Beihang University and eth Zurich presented BillMan innovative 1-bit post-training quantization scheme designed for pre-trained LLMs. BillM uses weight distribution analysis to identify outgoing weights and employs a binary residual approximation strategy to minimize compression loss. It also introduces an optimal split search for accurate binarization of non-salient weights with a bell-shaped distribution.

BillM presents a novel 1-bit post-training quantization method for LLM, which leverages weight sensitivity analysis via Hessian matrix. It employs a structured selection of salient weights and optimal splitting for non-salient weights, minimizing quantization error. BillM implements a binary residual approximation for salient weights and a bell-shaped distribution split for non-salient ones, achieving high-precision inference with ultra-low bitwidths and efficient implementation on GPU.

BillM, implemented in the PyTorch and Huggingface libraries, presents an innovative 1-bit PTQ framework for LLM. It outperforms existing methods such as GPTQ and PB-LLM, achieving superior perplexity results on various model sizes and datasets, including WikiText2, PTB, and C4. BillMStructured salient binarization and optimal splitting of non-salient weights significantly improve binary performance, demonstrating its universal applicability and robustness in various LLM environments.

In conclusion, researchers from the University of Hong Kong, Beihang University and eth Zurich presented BillM, a novel post-training binary quantization method for compressing pre-trained LLMs. Taking advantage of the binary residual approximation for salient weights and optimal segmentation for non-salient ones, BillM achieves ultra-low bit quantization without significant loss of precision. It establishes a new frontier in bitwidth quantization of LLMs, enabling deployment in edge scenarios and resource-constrained devices while maintaining performance guarantees.

Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter and Google news. Join our 37k+ ML SubReddit, 41k+ Facebook community, Discord Channeland LinkedIn Grabove.

If you like our work, you will love our Newsletter..

Don't forget to join our Telegram channel

Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.

<!– ai CONTENT END 2 –>

LLMWare Releases SLIM: Small Specialized Function Call Models for Multi-Step Automation (See All Models)