The introduction of pre-trained language models (PLM) has meant a transformative change in the field of natural language processing. They have demonstrated exceptional mastery in performing a wide range of linguistic tasks, including natural language understanding (NLU) and natural language generation (NLG). These models typically incorporate millions or even billions of parameters, resulting in significant computational and memory requirements. However, the considerable computational and memory needs of these models present significant challenges, as recognized by the research community.
In this paper, The authors present a new quantization framework known as LoRA-Fine-Tuning-aware Quantization (LoftQ). This framework is specifically designed for pre-trained models that require LoRA quantization and tuning. The framework actively combines low-rank approximation and works together with quantization to jointly approximate the original pre-trained high-precision weights.
The image above demonstrates the performance of QLoRA with different bits. Left: QLoRA initialization of LLAMA-2-13b in WikiText-2. Right: Apply QLoRA to LLAMA-2-13b in the WikiText-2 language modeling task. Less perplexity indicates better performance.
Quantification methods. We apply two quantization methods to show that LoftQ supports different quantization functions:
• Uniform quantization is a classical quantization method. Uniformly divides a continuous interval into 2N categories and stores a local maximum absolute value for dequantization.
• NF4 and its 2-bit variant NF2 are quantization methods used in QLoRA. They assume that high-precision values are drawn from a Gaussian distribution and map these values to discrete slots that have the same probability.
We perform 2- and 4-bit quantization on all models, achieving compression ratios of 25-30% and 15-20% at the 4- and 2-bit levels, respectively. All experiments are performed on NVIDIA A100 GPU.
The evaluation of their quantization framework is carried out through extensive experiments on various downstream tasks, including NLU, question answering, summarization, and NLG. The results of these experiments demonstrate that LoftQ consistently outperforms QLoRA at all accuracy levels. For example, with 4-bit quantization, they achieve an improvement of 1.1 and 0.8 on Rouge-1 for XSum and CNN/DailyMail, respectively. As the field of NLP continues to advance, new innovations and optimizations are expected to help close the gap between the immense potential of PLM and its practical implementation, benefiting a wide range of applications and users.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 31k+ ML SubReddit, Facebook community of more than 40,000 people, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
We are also on WhatsApp. Join our ai channel on Whatsapp.
Janhavi Lande, Graduated in Engineering Physics from IIT Guwahati, Class of 2023. She is an upcoming data scientist and has been working in the world of ml/ai research for the last two years. What fascinates him most is this ever-changing world and its constant demand for humans to keep up. In her hobbies she likes to travel, read and write poems.
<!– ai CONTENT END 2 –>