Having established the theoretical underpinnings of LoRA in the previous sections, specifically the hypothesis that adaptation occurs in a low-dimensional subspace and the mathematical formulation ΔW≈BA, we now turn to the practical matter of implementing this technique within neural network layers. The core idea is not to change the original weights W0 directly, but rather to augment the layer's computation with the low-rank update computed via matrices A and B.
The standard forward pass for a linear (fully connected) layer is typically represented as: y=xW0T+b where x is the input tensor, W0∈Rdout×din is the weight matrix, b is the optional bias vector, and y is the output tensor. The dimensions assume an input batch x∈RN×din yielding an output y∈RN×dout.
With LoRA, we keep W0 and b frozen. The adaptation ΔW=BA (where B∈Rdout×r, A∈Rr×din, and r is the rank) is added to the computation. The modified forward pass becomes: y=xW0T+x(ΔW)T+b Substituting ΔW=BA: y=xW0T+x(BA)T+b y=xW0T+xATBT+b
The LoRA paper introduces a scaling factor α/r applied to the low-rank update. This scaling helps manage the magnitude of the adaptation relative to the original weights, especially when changing the rank r. The final LoRA forward pass is: y=xW0T+(xATBT)rα+b
This formulation is significant:
Let's sketch out how to implement a LoRALinear
layer in PyTorch. This typically involves creating a custom module that wraps or replaces an existing nn.Linear
layer.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class LoRALinear(nn.Module):
""" Replaces a standard nn.Linear layer with a LoRA-adapted version. """
def __init__(
self,
original_layer: nn.Linear,
rank: int,
alpha: float = 1.0,
lora_dropout_p: float = 0.0,
):
super().__init__()
self.in_features = original_layer.in_features
self.out_features = original_layer.out_features
self.rank = rank
self.alpha = alpha
# Register original weight and bias as non-trainable parameters
self.weight = nn.Parameter(original_layer.weight.detach().clone())
self.weight.requires_grad = False
if original_layer.bias is not None:
self.bias = nn.Parameter(original_layer.bias.detach().clone())
self.bias.requires_grad = False
else:
# Use register_parameter to ensure 'bias' attribute exists, even if None
self.register_parameter('bias', None)
# Create and initialize LoRA matrices A and B
self.lora_A = nn.Parameter(torch.Tensor(rank, self.in_features))
self.lora_B = nn.Parameter(torch.Tensor(self.out_features, rank))
# Optional dropout layer for the LoRA path
if lora_dropout_p > 0.0:
self.lora_dropout = nn.Dropout(p=lora_dropout_p)
else:
self.lora_dropout = nn.Identity() # Acts as a pass-through
# Scaling factor
if rank > 0:
self.scaling = self.alpha / self.rank
else:
self.scaling = 1.0 # Avoid division by zero if rank is 0
# Initialize LoRA parameters
self.reset_lora_parameters()
def reset_lora_parameters(self):
""" Initialize LoRA matrices A and B. """
if self.rank > 0:
# Initialize A using Kaiming uniform for better gradient flow
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
# Initialize B with zeros, so initial adaptation is zero
nn.init.zeros_(self.lora_B)
def forward(self, x: torch.Tensor) -> torch.Tensor:
""" Executes the modified forward pass. """
# Compute the original (frozen) linear transformation
# Using F.linear avoids issues with device placement if mixing tensors
result = F.linear(x, self.weight, self.bias)
# Compute the LoRA adjustment if rank > 0
if self.rank > 0:
# Apply dropout to input x before LoRA matrices
x_lora = self.lora_dropout(x)
# Compute x @ A^T
# Input x_lora (N, d_in), weight lora_A (r, d_in) -> Output (N, r)
after_A = F.linear(x_lora, self.lora_A)
# Compute (x @ A^T) @ B^T
# Input after_A (N, r), weight lora_B (d_out, r) -> Output (N, d_out)
lora_adjustment = F.linear(after_A, self.lora_B)
# Add the scaled LoRA adjustment to the original result
result += lora_adjustment * self.scaling
return result
def train(self, mode: bool = True):
""" Ensure original weights remain frozen during training. """
super().train(mode)
# Explicitly set requires_grad to False after mode change
self.weight.requires_grad = False
if self.bias is not None:
self.bias.requires_grad = False
# Ensure LoRA parameters are trainable (they are by default)
# self.lora_A.requires_grad = True
# self.lora_B.requires_grad = True
def extra_repr(self) -> str:
""" Adds LoRA specific info to the module representation. """
return (f'in_features={self.in_features}, out_features={self.out_features}, '
f'rank={self.rank}, alpha={self.alpha}')
Key aspects of this implementation:
lora_A
is typically initialized using a standard scheme like Kaiming uniform, while lora_B
is initialized to zeros. Initializing lora_B
to zero ensures that the initial ΔW=BA is zero, meaning the adapted model starts exactly equivalent to the pre-trained model before training begins.requires_grad
is set to False
for the original weight
and bias
parameters. It's good practice to re-assert this within the train()
method override to prevent accidental unfreezing if model modes are switched improperly.torch.nn.functional.linear
(aliased as F.linear
) is common for applying linear transformations without needing the full nn.Linear
module overhead within the forward pass itself.scaling
factor α/r is applied before adding the LoRA adjustment.The computational flow within a LoRA layer can be visualized as follows:
A conceptual diagram illustrating the LoRA forward pass. The original path uses frozen weights (W0, optional b), while the parallel LoRA path introduces trainable low-rank matrices (A, B) whose output is scaled and added to the original result.
While nn.Linear
layers are the most common targets for LoRA in Transformers (particularly within the self-attention and feed-forward network blocks), the underlying principle of low-rank adaptation can be conceptually applied to other layer types:
nn.Conv2d
): The weight tensor in a convolutional layer also holds potential for low-rank adaptation. ΔW would be a 4D tensor, and techniques like decomposing it into a sequence of smaller convolutions or using low-rank tensor decompositions (like Tucker or CP) can be applied. Implementations exist but are less standardized than for linear layers.nn.Embedding
): An embedding layer's weight matrix is essentially a lookup table (V×dmodel, where V is vocabulary size). ΔW is a matrix of the same shape, making it directly amenable to LoRA (BA where B∈RV×r,A∈Rr×dmodel), similar to a linear layer.However, in practice, applying LoRA primarily to the Linear
layers, especially those in attention mechanisms (query, key, value, output projections) and feed-forward networks, often yields the best trade-off between parameter efficiency and performance for LLMs.
Let's quantify the parameter savings. Consider a standard nn.Linear
layer with din input features and dout output features.
Example: For a layer in a large model where din=dout=4096.
This represents a ~256x reduction in trainable parameters for this single layer (16,777,216/65,536≈256). When applied across multiple layers, the cumulative savings are substantial, dramatically reducing the memory footprint for optimizer states and gradients during training.
With this understanding of how to implement individual LoRA layers, the next step involves integrating these modified layers into complete Transformer architectures, which we will cover later in this chapter. Practical considerations like selecting the appropriate rank r and the scaling factor α are also critical for successful LoRA application and will be discussed shortly.
© 2025 ApX Machine Learning