Low-Rank Adaptation (LoRA) Principles

Evaluating how a language model learns mathematically is necessary to understand parameter reduction. During standard training, updating a dense layer requires learning a full matrix of weight changes, denoted as $\Delta W$ . If a pre-trained weight matrix $W$ has dimensions $d \times k$ , the update matrix $\Delta W$ must also have dimensions $d \times k$ . For modern language models with hidden sizes reaching into the thousands, a single weight matrix can contain tens of millions of parameters.

Researchers observed that over-parameterized neural networks possess a low intrinsic dimension. This means that while a model requires billions of parameters to learn general language representation during pre-training, it does not require that entire parameter space to adapt to a specific downstream task. The required weight updates can be accurately represented in a much lower-dimensional space. Low-Rank Adaptation applies this principle to reduce the computational burden of training.

Instead of calculating and storing the massive $\Delta W$ matrix, LoRA freezes the original matrix $W$ and approximates the update using matrix factorization. The update matrix $\Delta W$ is decomposed into two smaller matrices, $A$ and $B$ .

$W_{new} = W + \Delta W = W + BA$

In this equation, $W \in \mathbb{R}^{d \times k}$ . We define a rank parameter $r$ , where $r \ll \min(d, k)$ . The decomposition creates matrix $A \in \mathbb{R}^{r \times k}$ and matrix $B \in \mathbb{R}^{d \times r}$ . When you multiply $B$ and $A$ , the resulting matrix matches the original $d \times k$ dimensions, making it perfectly compatible for element-wise addition with the original weight matrix $W$ .

The mathematical efficiency of this approach becomes obvious when you calculate the trainable parameter count. Assume a linear layer in a transformer has an input dimension $k = 4096$ and an output dimension $d = 4096$ .

Standard fine-tuning requires updating the full $4096 \times 4096$ matrix.

$4096 \times 4096 = 16,777,216 \text{ parameters}$

If you apply LoRA with a rank $r = 8$ , you only train matrices $A$ and $B$ . Matrix $A$ has dimensions $8 \times 4096$ , and matrix $B$ has dimensions $4096 \times 8$ .

$(4096 \times 8) + (8 \times 4096) = 32,768 + 32,768 = 65,536 \text{ parameters}$

Trainable parameter comparison between standard weight updates and a low-rank adapter configuration.

By updating just $65,536$ parameters instead of nearly $17$ million, you reduce the trainable parameters for that layer by over 99.6%. Because modern optimizers like Adam store running averages of gradients for every single trainable parameter, this massive reduction in parameters corresponds directly to a massive reduction in required GPU VRAM.

During the forward pass of training, the model processes the input vector $x$ through both the frozen weights and the trainable adapter matrices simultaneously. The operation is expressed as:

$h = Wx + BAx$

Data flow in a transformer layer using Low-Rank Adaptation matrices.

The initialization of these matrices is heavily engineered to ensure training stability. Matrix $A$ is initialized with a random Gaussian distribution. Matrix $B$ is initialized with zeros. Because matrix $B$ starts entirely as zeros, the product $BA$ evaluates to exactly zero at the beginning of training. This guarantees that $\Delta W = 0$ on the first training step, meaning the network behaves exactly like the unmodified base model until the first weight updates occur.

LoRA also introduces a scaling factor $\alpha$ (alpha) to manage the magnitude of the weight updates. The product of $B$ and $A$ is scaled by the ratio of $\alpha$ to $r$ before being added to the base weights.

$\Delta W = \frac{\alpha}{r} BA$

This scaling mechanism ensures that changing the rank $r$ during hyperparameter tuning does not force you to drastically retune the learning rate. If you increase the rank to capture more complex patterns in your custom dataset, the $\alpha / r$ ratio normalizes the initial gradients, keeping the learning process mathematically stable across different configurations.

References

LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2022 The Tenth International Conference on Learning Representations (ICLR) (OpenReview.net) DOI: 10.48550/arXiv.2106.09685 - The original research paper that introduced the LoRA architecture, providing the mathematical derivation and empirical results.
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning, Armen Aghajanyan, Sonal Gupta, Luke Zettlemoyer, 2021 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Association for Computational Linguistics) DOI: 10.18653/v1/2021.acl-long.568 - Research establishing the concept of low intrinsic dimension in language models, which serves as the theoretical justification for LoRA.
PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods, Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, Benjamin Bossan, 2022 (Hugging Face) - The official documentation and repository for the most widely used library to implement LoRA in PyTorch environments.
QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023 Advances in Neural Information Processing Systems, Vol. 36 DOI: 10.48550/arXiv.2305.14314 - An expansion of LoRA principles that introduces quantization to further reduce memory requirements for fine-tuning.