Configuring Target Modules and Rank

Low-Rank Adaptation reduces parameter counts by decomposing weight updates into two smaller matrices. To apply this effectively in practice, you must configure two primary settings for the training process. This involves determining exactly where in the neural network these smaller matrices will be attached, known as target modules, and defining the size of these matrices, governed by the rank parameter.

Target modules represent the specific layers within the model architecture that will receive the trainable adapters. Small Language Models are typically built using a stack of transformer blocks. Each block consists of a self-attention mechanism and a feed-forward neural network. The self-attention mechanism performs its operations using several distinct linear layers, usually designated as query, key, value, and output projection matrices. The feed-forward network similarly contains linear layers, often labeled as gate, up, and down projections.

Historically, memory constraints forced practitioners to apply adapters exclusively to the query and value matrices within the attention mechanism. Current hardware optimizations and training libraries now make it feasible to target all linear layers across both the attention mechanisms and the feed-forward networks. Targeting all linear layers provides the model with a higher capacity to adapt to complex instructions without severely inflating the VRAM requirements.

Transformer block architecture with LoRA adapters attached to the linear projection layers of the self-attention and feed-forward modules.

Once you have identified the target modules, you must set the rank parameter, denoted as $r$ . The rank dictates the inner dimension of the low-rank matrices $A$ and $B$ . If the original pre-trained weight matrix $W$ has dimensions $d \times k$ , the adapter matrix $A$ will have dimensions $d \times r$ , and matrix $B$ will have dimensions $r \times k$ .

The mathematical relationship directly impacts the number of parameters you will train. For a linear layer with an input dimension of $d = 4096$ and an output dimension of $k = 4096$ , a standard full weight update requires $4096 \times 4096$ parameters, totaling 16,777,216. If you configure a rank of $r = 8$ , the two matrices combined will contain only $(4096 \times 8) + (8 \times 4096)$ parameters. This results in 65,536 trainable parameters, representing a significant reduction in computational overhead.

Choosing the appropriate rank involves balancing performance with resource usage. A lower rank, such as 8 or 16, is often adequate for straightforward tasks like text classification or enforcing a specific output format. For tasks demanding complex reasoning or teaching the model entirely new syntax, a higher rank like 32, 64, or 128 is recommended. Higher ranks increase both the VRAM usage and the time required for each training step.

Estimated relationship between the chosen rank parameter and the total number of trainable parameters for a typical target module configuration.

Alongside rank, the configuration requires setting a scaling factor known as alpha ( $\alpha$ ). During the forward pass, the output from the low-rank matrices is scaled by the ratio of $\alpha / r$ . The complete mathematical operation for calculating the hidden state $h$ from an input $x$ is defined as:

$h = Wx + \frac{\alpha}{r} ABx$

This scaling mechanism ensures that the magnitude of the weight updates remains consistent even if you decide to change the rank later in your experiments. A widely accepted heuristic is to set $\alpha$ to twice the value of $r$ , or simply equal to $r$ . If your configuration uses a rank of 16, setting $\alpha$ to 32 serves as a reliable starting baseline. If you adjust $r$ to 32, you would correspondingly adjust $\alpha$ to 64, preventing the need to completely retune your learning rate.

Finally, to mitigate overfitting on your specific dataset, you should configure a dropout rate for the adapter layers. LoRA dropout randomly zeroes out a small percentage of the adapter weights during each forward pass in the training loop. A dropout value between 0.05 and 0.1 is standard. This forces the neural network to distribute its learning across all available parameters, improving the model's ability to generalize to unseen prompts during inference.

References

LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2021 arXiv preprint DOI: 10.48550/arXiv.2106.09685 - The original research paper that introduced the LoRA method, defining the rank parameter and scaling factor.
PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods, Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, Benjamin Bossan, 2022 Hugging Face Documentation - Official documentation for the industry-standard library used to implement LoRA and configure target modules.
QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023 Advances in Neural Information Processing Systems, Vol. 36 DOI: 10.48550/arXiv.2305.14314 - Provides experimental evidence on the effectiveness of targeting all linear layers in the transformer architecture.