Low-Rank Adaptation (LoRA) introduces a specific mathematical structure designed around the hypothesis that weight updates during adaptation possess a low intrinsic rank, aiming for parameter efficiency. Standard fine-tuning updates a pre-trained weight matrix by adding a delta matrix , resulting in the adapted weights . This traditional training method requires learning all parameters in .
LoRA proposes a different approach. Instead of learning the potentially large directly, we approximate it using a low-rank decomposition. Specifically, is represented by the product of two smaller matrices, and :
Here, is the rank of the decomposition, and the core idea of LoRA is that . This constraint significantly reduces the number of parameters we need to learn. While the original weights are kept frozen (not updated during training), the matrices and contain the trainable parameters representing the task-specific adaptation.
Consider the forward pass through a layer modified by LoRA. For an input , the original output is . The modified output, incorporating the LoRA update, becomes:
During fine-tuning with LoRA, only the parameters within matrices and are updated via gradient descent. The original weights remain unchanged.
To further control the adaptation process, LoRA introduces a constant scaling factor . This scalar modulates the magnitude of the update applied by . It's common practice to scale the update by . This normalization helps stabilize training, especially when changing the rank . The final forward pass equation for a LoRA-modified layer is:
Let's analyze the parameter efficiency gain. Full fine-tuning requires learning parameters for the matrix. With LoRA, we only need to learn the parameters in and . The total number of trainable parameters in LoRA is the sum of parameters in () and (), which equals . Since is typically much smaller than and , the reduction in trainable parameters is substantial. For example, if , , and , full fine-tuning requires approximately million parameters, whereas LoRA requires only parameters for that specific layer. This represents a parameter reduction of over 99%.
The structure can be visualized as a parallel path added to the original weight matrix:
Forward pass through a LoRA-modified layer. Input passes through the frozen weights and, in parallel, through the trainable low-rank matrices and . The low-rank path is scaled by before being added to the original output.
Regarding initialization, a common strategy is to initialize using random Gaussian values and with zeros. This ensures that is zero at the beginning of training, meaning the adapted model starts exactly as the pre-trained model . The scaling factor is typically set to match the initial value of , although it can be treated as a hyperparameter. We will examine initialization and hyperparameter choices like and in more detail in subsequent sections (Rank Selection Strategies, Scaling Parameter Alpha, LoRA Initialization Strategies).
In summary, the mathematical formulation of LoRA provides a concrete mechanism for approximating the weight update with a low-rank structure . By freezing the original weights and only training the small matrices and , LoRA achieves significant parameter efficiency, drastically reducing the computational and memory requirements for fine-tuning large models while aiming to preserve adaptation capability.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•