Building upon the hypothesis that the weight update during adaptation possesses a low intrinsic rank, LoRA introduces a specific mathematical structure to exploit this property for parameter efficiency. Recall that standard fine-tuning updates a pre-trained weight matrix W0∈Rd×k by adding a delta matrix ΔW∈Rd×k, resulting in the adapted weights W=W0+ΔW. Training involves learning all d×k parameters in ΔW.
LoRA proposes a different approach. Instead of learning the potentially large ΔW directly, we approximate it using a low-rank decomposition. Specifically, ΔW is represented by the product of two smaller matrices, B∈Rd×r and A∈Rr×k:
ΔW≈BAHere, r is the rank of the decomposition, and the core idea of LoRA is that r≪min(d,k). This constraint significantly reduces the number of parameters we need to learn. While the original weights W0 are kept frozen (not updated during training), the matrices A and B contain the trainable parameters representing the task-specific adaptation.
Consider the forward pass through a layer modified by LoRA. For an input x, the original output is h=W0x. The modified output, incorporating the LoRA update, becomes:
h=W0x+ΔWx=W0x+BAxDuring fine-tuning with LoRA, only the parameters within matrices A and B are updated via gradient descent. The original weights W0 remain unchanged.
To further control the adaptation process, LoRA introduces a constant scaling factor α. This scalar modulates the magnitude of the update applied by BA. It's common practice to scale the update by rα. This normalization helps stabilize training, especially when changing the rank r. The final forward pass equation for a LoRA-modified layer is:
h=W0x+rαBAxLet's analyze the parameter efficiency gain. Full fine-tuning requires learning d×k parameters for the ΔW matrix. With LoRA, we only need to learn the parameters in A and B. The total number of trainable parameters in LoRA is the sum of parameters in A (r×k) and B (d×r), which equals r(d+k). Since r is typically much smaller than d and k, the reduction in trainable parameters is substantial. For example, if d=4096, k=4096, and r=8, full fine-tuning requires approximately 16.7 million parameters, whereas LoRA requires only 8×(4096+4096)=65,536 parameters for that specific layer. This represents a parameter reduction of over 99%.
The structure can be visualized as a parallel path added to the original weight matrix:
Forward pass through a LoRA-modified layer. Input x passes through the frozen weights W0 and, in parallel, through the trainable low-rank matrices A and B. The low-rank path is scaled by α/r before being added to the original output.
Regarding initialization, a common strategy is to initialize A using random Gaussian values and B with zeros. This ensures that ΔW=BA is zero at the beginning of training, meaning the adapted model starts exactly as the pre-trained model W0. The scaling factor α is typically set to match the initial value of r, although it can be treated as a hyperparameter. We will examine initialization and hyperparameter choices like r and α in more detail in subsequent sections (Rank Selection Strategies, Scaling Parameter Alpha, LoRA Initialization Strategies).
In summary, the mathematical formulation of LoRA provides a concrete mechanism for approximating the weight update ΔW with a low-rank structure BA. By freezing the original weights W0 and only training the small matrices A and B, LoRA achieves significant parameter efficiency, drastically reducing the computational and memory requirements for fine-tuning large models while aiming to preserve adaptation capability.
© 2025 ApX Machine Learning