Understanding how to initialize trainable parameters is important for sophisticated applications of LoRA. The way initial values are set for the low-rank matrices and can influence training stability, convergence speed, and ultimately, the quality of the fine-tuned model. Common initialization strategies and their implications are examined.
Recall that the LoRA update for a weight matrix is calculated as , where and for a weight matrix . The modified weight matrix during fine-tuning is , although for simplicity in discussing initialization, we often focus on the core term, keeping in mind the scaling factor applies later. The goal of initialization is to set the starting values for and appropriately.
The most widely adopted and often default strategy for LoRA initialization is:
The core idea behind this approach is to ensure that at the very beginning of the fine-tuning process (step ), the adaptation term is exactly zero.
This means the model with LoRA layers initially behaves identically to the pre-trained base model. The fine-tuning process then gradually learns non-zero values for , allowing the adaptation to emerge and adjust the model's behavior based on the task-specific data.
Advantages:
Disadvantages:
This strategy is implemented by default in popular libraries like Hugging Face's PEFT (peft). For instance, the LoraLayer often initializes lora_B weights to zeros and lora_A weights using Kaiming uniform initialization.
An alternative approach is to initialize both matrices and using a random distribution, typically Gaussian () with a carefully chosen (usually small) variance .
In this case, at , the adaptation term will be a non-zero matrix, albeit likely one with small entries if is small.
Rationale:
The motivation here is that starting with a small, non-zero random adaptation might allow the model to start learning the required adjustments more quickly, potentially accelerating convergence. The initial random provides an immediate, albeit noisy, direction for adaptation.
Considerations:
Advantages:
Disadvantages:
This chart illustrates how the magnitude of the LoRA update term might evolve. Zero-initialization for starts the update at zero, while Gaussian initialization for both and starts with a small, non-zero update.
For most practical applications, starting with the default strategy (zero-initialize , random-initialize ) is recommended. It provides a stable and reliable baseline that preserves the integrity of the pre-trained model at the onset of fine-tuning. This approach is less sensitive to hyperparameter choices compared to initializing both matrices randomly.
Consider experimenting with random initialization for both and only if:
Remember that the effective initial update magnitude is also influenced by the LoRA rank and the scaling factor . The update is . Even with Gaussian initialization for and , a very small or a very large will diminish the initial perturbation caused by . Conversely, a large or small will amplify it. These factors interact, and tuning them together is part of optimizing LoRA performance.
While other initialization schemes could be devised, perhaps drawing inspiration from matrix factorization techniques or incorporating prior knowledge about the task, the Kaiming/Gaussian for and zero for remains the prevalent and effective starting point for applying LoRA.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with