The LoRA hypothesis proposes that the change required to adapt a pre-trained weight matrix W for a new task can be represented effectively using a low-rank structure. Instead of directly learning the potentially large update matrix ΔW, LoRA decomposes this update into two smaller matrices, B and A.
Let's consider an original weight matrix W∈Rd×k from a layer in our LLM (e.g., a linear layer in the self-attention or feed-forward network). The full fine-tuning approach would involve updating all d×k parameters in W. LoRA avoids this by keeping W frozen and introducing two new matrices:
Here, r is the rank of the decomposition, and it's a hyperparameter chosen such that r≪min(d,k). The original weight update ΔW is then approximated by the product of these two matrices:
ΔW≈BA
Notice the dimensions: multiplying B (d×r) and A (r×k) results in a matrix of the original dimensions d×k, suitable as an update for W.
During the fine-tuning process, the forward pass of the layer is modified. For an input x∈Rk, the original computation is h=Wx. With LoRA, the output h∈Rd becomes:
h=Wx+ΔWx=Wx+BAx
Critically, the pre-trained weights W are not updated during training. Only the parameters of the much smaller matrices A and B are optimized. This dramatically reduces the number of trainable parameters.
The LoRA technique introduces a scaling factor α. The modified forward pass incorporating this scaling is often written as:
h=Wx+rαBAx
Here, α is another hyperparameter that scales the contribution of the LoRA update. Dividing by the rank r helps to normalize the magnitude of the update, making the effect of α less dependent on the choice of r. Essentially, α controls the extent to which the adapted model deviates from the original model.
The efficiency gain comes from the significant reduction in trainable parameters. Instead of training d×k parameters for ΔW, we only train the parameters in A and B.
Total trainable parameters in LoRA = Parameters(A) + Parameters(B) = (r×k)+(d×r)=r(d+k).
Consider a typical large linear layer where d=k=4096.
In this example, LoRA requires training only 65,536/16,777,216≈0.39% of the parameters compared to full fine-tuning for this single layer. This reduction is substantial, especially when applied across multiple layers in a large model, leading to significant savings in memory (for optimizer states, gradients) and potentially faster training iterations.
Proper initialization of the LoRA matrices is important for stable training. A common strategy is to:
This ensures that at the beginning of training (t=0), the product BA is zero. Consequently, the initial LoRA-adapted model behaves exactly like the original pre-trained model (ΔW=0), providing a stable starting point for the adaptation process.
We can visualize how the LoRA matrices modify the standard forward pass of a linear layer.
The data flow in a LoRA-adapted layer. The input
xpasses through the original frozen weight matrixW. In parallel,xis also processed by the low-rank matricesAandB, followed by scaling. The outputs of both paths are then added together to produce the final outputh.
This decomposition allows LoRA to adapt large models efficiently by focusing the learning process on a small, carefully structured set of parameters that approximate the necessary changes to the original weights. The next sections explore practical aspects like selecting the rank r and the scaling factor α.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with