Low-Rank Adaptation (LoRA) is based on the hypothesis that adaptation occurs in a low-dimensional subspace, mathematically expressed as ΔW≈BA. This technique can be implemented within neural network layers by augmenting the layer's computation with the low-rank update, which is computed via matrices A and B. The primary idea is to add this update without directly changing the original weights W0.
The standard forward pass for a linear (fully connected) layer is typically represented as: y=xW0T+b where x is the input tensor, W0∈Rdout×din is the weight matrix, b is the optional bias vector, and y is the output tensor. The dimensions assume an input batch x∈RN×din yielding an output y∈RN×dout.
With LoRA, we keep W0 and b frozen. The adaptation ΔW=BA (where B∈Rdout×r, A∈Rr×din, and r is the rank) is added to the computation. The modified forward pass becomes: y=xW0T+x(ΔW)T+b Substituting ΔW=BA: y=xW0T+x(BA)T+b y=xW0T+xATBT+b
The LoRA paper introduces a scaling factor α/r applied to the low-rank update. This scaling helps manage the magnitude of the adaptation relative to the original weights, especially when changing the rank r. The final LoRA forward pass is: y=xW0T+(xATBT)rα+b
This formulation is significant:
Let's sketch out how to implement a LoRALinear layer in PyTorch. This typically involves creating a custom module that wraps or replaces an existing nn.Linear layer.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class LoRALinear(nn.Module):
""" Replaces a standard nn.Linear layer with a LoRA-adapted version. """
def __init__(
self,
original_layer: nn.Linear,
rank: int,
alpha: float = 1.0,
lora_dropout_p: float = 0.0,
):
super().__init__()
self.in_features = original_layer.in_features
self.out_features = original_layer.out_features
self.rank = rank
self.alpha = alpha
# Register original weight and bias as non-trainable parameters
self.weight = nn.Parameter(original_layer.weight.detach().clone())
self.weight.requires_grad = False
if original_layer.bias is not None:
self.bias = nn.Parameter(original_layer.bias.detach().clone())
self.bias.requires_grad = False
else:
# Use register_parameter to ensure 'bias' attribute exists, even if None
self.register_parameter('bias', None)
# Create and initialize LoRA matrices A and B
self.lora_A = nn.Parameter(torch.Tensor(rank, self.in_features))
self.lora_B = nn.Parameter(torch.Tensor(self.out_features, rank))
# Optional dropout layer for the LoRA path
if lora_dropout_p > 0.0:
self.lora_dropout = nn.Dropout(p=lora_dropout_p)
else:
self.lora_dropout = nn.Identity() # Acts as a pass-through
# Scaling factor
if rank > 0:
self.scaling = self.alpha / self.rank
else:
self.scaling = 1.0 # Avoid division by zero if rank is 0
# Initialize LoRA parameters
self.reset_lora_parameters()
def reset_lora_parameters(self):
""" Initialize LoRA matrices A and B. """
if self.rank > 0:
# Initialize A using Kaiming uniform for better gradient flow
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
# Initialize B with zeros, so initial adaptation is zero
nn.init.zeros_(self.lora_B)
def forward(self, x: torch.Tensor) -> torch.Tensor:
""" Executes the modified forward pass. """
# Compute the original (frozen) linear transformation
# Using F.linear avoids issues with device placement if mixing tensors
result = F.linear(x, self.weight, self.bias)
# Compute the LoRA adjustment if rank > 0
if self.rank > 0:
# Apply dropout to input x before LoRA matrices
x_lora = self.lora_dropout(x)
# Compute x @ A^T
# Input x_lora (N, d_in), weight lora_A (r, d_in) -> Output (N, r)
after_A = F.linear(x_lora, self.lora_A)
# Compute (x @ A^T) @ B^T
# Input after_A (N, r), weight lora_B (d_out, r) -> Output (N, d_out)
lora_adjustment = F.linear(after_A, self.lora_B)
# Add the scaled LoRA adjustment to the original result
result += lora_adjustment * self.scaling
return result
def train(self, mode: bool = True):
""" Ensure original weights remain frozen during training. """
super().train(mode)
# Explicitly set requires_grad to False after mode change
self.weight.requires_grad = False
if self.bias is not None:
self.bias.requires_grad = False
# Ensure LoRA parameters are trainable (they are by default)
# self.lora_A.requires_grad = True
# self.lora_B.requires_grad = True
def extra_repr(self) -> str:
""" Adds LoRA specific info to the module representation. """
return (f'in_features={self.in_features}, out_features={self.out_features}, '
f'rank={self.rank}, alpha={self.alpha}')
Important aspects of this implementation:
lora_A is typically initialized using a standard scheme like Kaiming uniform, while lora_B is initialized to zeros. Initializing lora_B to zero ensures that the initial ΔW=BA is zero, meaning the adapted model starts exactly equivalent to the pre-trained model before training begins.requires_grad is set to False for the original weight and bias parameters. It's good practice to re-assert this within the train() method override to prevent accidental unfreezing if model modes are switched improperly.torch.nn.functional.linear (aliased as F.linear) is common for applying linear transformations without needing the full nn.Linear module overhead within the forward pass itself.scaling factor α/r is applied before adding the LoRA adjustment.The computational flow within a LoRA layer can be visualized as follows:
A diagram illustrating the LoRA forward pass. The original path uses frozen weights (W0, optional b), while the parallel LoRA path introduces trainable low-rank matrices (A, B) whose output is scaled and added to the original result.
While nn.Linear layers are the most common targets for LoRA in Transformers (particularly within the self-attention and feed-forward network blocks), the underlying principle of low-rank adaptation can be applied to other layer types:
nn.Conv2d): The weight tensor in a convolutional layer also holds potential for low-rank adaptation. ΔW would be a 4D tensor, and techniques like decomposing it into a sequence of smaller convolutions or using low-rank tensor decompositions (like Tucker or CP) can be applied. Implementations exist but are less standardized than for linear layers.nn.Embedding): An embedding layer's weight matrix is essentially a lookup table (V×dmodel, where V is vocabulary size). ΔW is a matrix of the same shape, making it directly amenable to LoRA (BA where B∈RV×r,A∈Rr×dmodel), similar to a linear layer.However, in practice, applying LoRA primarily to the Linear layers, especially those in attention mechanisms (query, key, value, output projections) and feed-forward networks, often yields the best trade-off between parameter efficiency and performance for LLMs.
Let's quantify the parameter savings. Consider a standard nn.Linear layer with din input features and dout output features.
Example: For a layer in a large model where din=dout=4096.
This represents a ~256x reduction in trainable parameters for this single layer (16,777,216/65,536≈256). When applied across multiple layers, the cumulative savings are substantial, dramatically reducing the memory footprint for optimizer states and gradients during training.
With this understanding of how to implement individual LoRA layers, the next step involves integrating these modified layers into complete Transformer architectures, which we will cover later in this chapter. Practical considerations like selecting the appropriate rank r and the scaling factor α are also critical for successful LoRA application and will be discussed shortly.
Was this section helpful?
torch.nn.Module, PyTorch Authors, 2024 (PyTorch Foundation) - The official documentation for torch.nn.Module provides core information on building neural network layers and models in PyTorch, including parameter management and forward pass definition.© 2026 ApX Machine LearningEngineered with