As introduced in the LoRA hypothesis, the core idea is that the change required to adapt a pre-trained weight matrix $W$ for a new task can be represented effectively using a low-rank structure. Instead of directly learning the potentially large update matrix $\Delta W$ , LoRA decomposes this update into two smaller matrices, $B$ and $A$ .

Let's consider an original weight matrix $W \in \mathbb{R}^{d \times k}$ from a layer in our LLM (e.g., a linear layer in the self-attention or feed-forward network). The full fine-tuning approach would involve updating all $d \times k$ parameters in $W$ . LoRA avoids this by keeping $W$ frozen and introducing two new matrices:

$A \in \mathbb{R}^{r \times k}$
$B \in \mathbb{R}^{d \times r}$

Here, $r$ is the rank of the decomposition, and it's a hyperparameter chosen such that $r \ll \min(d, k)$ . The original weight update $\Delta W$ is then approximated by the product of these two matrices:

$\Delta W \approx BA$

Notice the dimensions: multiplying $B$ ( $d \times r$ ) and $A$ ( $r \times k$ ) results in a matrix of the original dimensions $d \times k$ , suitable as an update for $W$ .

Modified Forward Pass

During the fine-tuning process, the forward pass of the layer is modified. For an input $x \in \mathbb{R}^{k}$ , the original computation is $h = Wx$ . With LoRA, the output $h \in \mathbb{R}^{d}$ becomes:

$h = Wx + \Delta W x = Wx + BAx$

Critically, the pre-trained weights $W$ are not updated during training. Only the parameters of the much smaller matrices $A$ and $B$ are optimized. This dramatically reduces the number of trainable parameters.

The LoRA technique introduces a scaling factor $\alpha$ . The modified forward pass incorporating this scaling is often written as:

$h = Wx + \frac{\alpha}{r} BAx$

Here, $\alpha$ is another hyperparameter that scales the contribution of the LoRA update. Dividing by the rank $r$ helps to normalize the magnitude of the update, making the effect of $\alpha$ less dependent on the choice of $r$ . Essentially, $\alpha$ controls the extent to which the adapted model deviates from the original model.

Parameter Efficiency

The efficiency gain comes from the significant reduction in trainable parameters. Instead of training $d \times k$ parameters for $\Delta W$ , we only train the parameters in $A$ and $B$ .

Total trainable parameters in LoRA = Parameters(A) + Parameters(B) = $(r \times k) + (d \times r) = r(d + k)$ .

Consider a typical large linear layer where $d = k = 4096$ .

Parameters in $W$ (or $\Delta W$ ): $4096 \times 4096 = 16,777,216$ .
Parameters in LoRA with $r=8$ : $8 \times (4096 + 4096) = 8 \times 8192 = 65,536$ .

In this example, LoRA requires training only $65,536 / 16,777,216 \approx 0.39\%$ of the parameters compared to full fine-tuning for this single layer. This reduction is substantial, especially when applied across multiple layers in a large model, leading to significant savings in memory (for optimizer states, gradients) and potentially faster training iterations.

Initialization

Proper initialization of the LoRA matrices is important for stable training. A common strategy is to:

Initialize $A$ using random Gaussian values.
Initialize $B$ with zeros.

This ensures that at the beginning of training ( $t=0$ ), the product $BA$ is zero. Consequently, the initial LoRA-adapted model behaves exactly like the original pre-trained model ( $\Delta W = 0$ ), providing a stable starting point for the adaptation process.

Visualizing the Data Flow

We can visualize how the LoRA matrices modify the standard forward pass of a linear layer.

The data flow in a LoRA-adapted layer. The input x passes through the original frozen weight matrix W. In parallel, x is also processed by the low-rank matrices A and B, followed by scaling. The outputs of both paths are then added together to produce the final output h.

This decomposition allows LoRA to adapt large models efficiently by focusing the learning process on a small, carefully structured set of parameters that approximate the necessary changes to the original weights. The next sections explore practical aspects like selecting the rank $r$ and the scaling factor $\alpha$ .