Understanding LoRA: Low-Range Adaptation for Tuning Large Models | by Bhavin Jawade | December 2023

Mathematics behind this efficient parameter tuning method

Fine-tuning large pre-trained models is computationally challenging and often involves tuning millions of parameters. This traditional tuning approach, while effective, requires substantial time and computational resources, posing a bottleneck for tailoring these models to specific tasks. LoRA presented an effective solution to this problem by decomposing the update matrix during adjustment. To study LoRA, let's start by first reviewing traditional tuning.

In traditional tuning, we modify the weights of a previously trained neural network to adapt it to a new task. This adjustment involves altering the original weight matrix (W) of the network. The changes made to (W) during fine-tuning are collectively represented by (ΔW), so the updated weights can be expressed as (W + ΔW).

Now, instead of modifying (W) directly, the LoRA approach seeks to decompose (ΔW). This decomposition is a crucial step to reduce the computational overhead associated with fitting large models.

The traditional fit can be reinvented above. Here W is frozen while ΔW is trainable (Image by blog author)

The intrinsic range hypothesis suggests that significant changes in the neural network can be captured using a lower-dimensional representation. Essentially, it posits that not all elements of (ΔW) are equally important; instead, a smaller subset of these changes can effectively encapsulate the necessary adjustments.

Starting from this hypothesis, LoRA proposes to represent (ΔW) as the product of two smaller matrices, (A) and (B), of lower rank. The updated weight matrix (W') thus becomes:

(W' = W + BA)

In this equation, (W) remains frozen (i.e., it is not updated during training). The matrices (B) and (A) are of lower dimensionality, and their product (BA) represents a low-rank approximation of (ΔW).

ΔW is decomposed into two matrices A and B, where both have a dimensionality less than dx d. (Image from blog author)

By choosing matrices (A) and (B) to have a lower rank (r), the number of trainable parameters is significantly reduced. For example, if (W) is a matrix (dxd), traditionally, updating (W) would involve parameters (d²). However, with (B) and (A) of sizes (dxr) and (rxd) respectively, the total number of parameters is reduced to (2dr), which is much smaller when (r << d).

The reduction in the number of trainable parameters, achieved through the Low Rank Adaptation (LoRA) method, offers several significant benefits, particularly when fine-tuning large-scale neural networks:

Reduced memory footprint: LoRA decreases memory needs by reducing the number of parameters to update, which helps in managing large-scale models.
Faster training and adaptation: By simplifying computational demands, LoRA accelerates the training and tuning of large models for new tasks.
Feasibility for smaller hardware: LoRA's smaller number of parameters allows fine-tuning of important models on less powerful hardware, such as modest GPUs or CPUs.
Scaling to larger models: LoRA makes it easier to expand ai models without a corresponding increase in computational resources, making managing models of increasing size more practical.

In the context of LoRA, the concept of range plays a fundamental role in determining the efficiency and effectiveness of the adaptation process. Surprisingly, the article highlights that the range of the matrices TO and b can be surprisingly low, sometimes as low as one.

Although the LoRA paper predominantly shows experiments within the realm of natural language processing (NLP), the underlying low-rank adaptation approach has broad applicability and could be effectively employed in training various types of neural networks in different domains.

LoRA's approach to decomposing (ΔW) into a product of lower-rank matrices effectively balances the need to adapt large pre-trained models to new tasks while maintaining computational efficiency. The concept of intrinsic range is key to this balance, as it ensures that the essence of the model's learnability is preserved with far fewer parameters.

References:
(1) Hu, Edward J., et al. “Lora: low-rank adaptation of large language models”. arXiv preprint arXiv:2106.09685 (2021).