As introduced in the LoRA hypothesis, the core idea is that the change required to adapt a pre-trained weight matrix W for a new task can be represented effectively using a low-rank structure. Instead of directly learning the potentially large update matrix ΔW, LoRA decomposes this update into two smaller matrices, B and A.
Let's consider an original weight matrix W∈Rd×k from a layer in our LLM (e.g., a linear layer in the self-attention or feed-forward network). The full fine-tuning approach would involve updating all d×k parameters in W. LoRA avoids this by keeping W frozen and introducing two new matrices:
Here, r is the rank of the decomposition, and it's a hyperparameter chosen such that r≪min(d,k). The original weight update ΔW is then approximated by the product of these two matrices:
ΔW≈BA
Notice the dimensions: multiplying B (d×r) and A (r×k) results in a matrix of the original dimensions d×k, suitable as an update for W.
During the fine-tuning process, the forward pass of the layer is modified. For an input x∈Rk, the original computation is h=Wx. With LoRA, the output h∈Rd becomes:
h=Wx+ΔWx=Wx+BAx
Critically, the pre-trained weights W are not updated during training. Only the parameters of the much smaller matrices A and B are optimized. This dramatically reduces the number of trainable parameters.
The LoRA technique introduces a scaling factor α. The modified forward pass incorporating this scaling is often written as:
h=Wx+rαBAx
Here, α is another hyperparameter that scales the contribution of the LoRA update. Dividing by the rank r helps to normalize the magnitude of the update, making the effect of α less dependent on the choice of r. Essentially, α controls the extent to which the adapted model deviates from the original model.
The efficiency gain comes from the significant reduction in trainable parameters. Instead of training d×k parameters for ΔW, we only train the parameters in A and B.
Total trainable parameters in LoRA = Parameters(A) + Parameters(B) = (r×k)+(d×r)=r(d+k).
Consider a typical large linear layer where d=k=4096.
In this example, LoRA requires training only 65,536/16,777,216≈0.39% of the parameters compared to full fine-tuning for this single layer. This reduction is substantial, especially when applied across multiple layers in a large model, leading to significant savings in memory (for optimizer states, gradients) and potentially faster training iterations.
Proper initialization of the LoRA matrices is important for stable training. A common strategy is to:
This ensures that at the beginning of training (t=0), the product BA is zero. Consequently, the initial LoRA-adapted model behaves exactly like the original pre-trained model (ΔW=0), providing a stable starting point for the adaptation process.
We can visualize how the LoRA matrices modify the standard forward pass of a linear layer.
The data flow in a LoRA-adapted layer. The input
x
passes through the original frozen weight matrixW
. In parallel,x
is also processed by the low-rank matricesA
andB
, followed by scaling. The outputs of both paths are then added together to produce the final outputh
.
This decomposition allows LoRA to adapt large models efficiently by focusing the learning process on a small, carefully structured set of parameters that approximate the necessary changes to the original weights. The next sections explore practical aspects like selecting the rank r and the scaling factor α.
© 2025 ApX Machine Learning