As we move into more sophisticated applications of LoRA, understanding how to initialize the trainable parameters becomes significant. The way we set the initial values for the low-rank matrices A and B can influence training stability, convergence speed, and ultimately, the quality of the fine-tuned model. This section examines common initialization strategies and their implications.
Recall that the LoRA update ΔW for a weight matrix W0 is calculated as ΔW=BA, where B∈Rd×r and A∈Rr×k for a weight matrix W0∈Rd×k. The modified weight matrix during fine-tuning is W=W0+αrBA, although for simplicity in discussing initialization, we often focus on the core BA term, keeping in mind the scaling factor α/r applies later. The goal of initialization is to set the starting values for A and B appropriately.
The most widely adopted and often default strategy for LoRA initialization is:
The core idea behind this approach is to ensure that at the very beginning of the fine-tuning process (step t=0), the adaptation term ΔW=BA is exactly zero.
Wt=0=W0+BA=W0+0⋅A=W0This means the model with LoRA layers initially behaves identically to the pre-trained base model. The fine-tuning process then gradually learns non-zero values for B, allowing the adaptation ΔW to emerge and adjust the model's behavior based on the task-specific data.
Advantages:
Disadvantages:
This strategy is implemented by default in popular libraries like Hugging Face's PEFT (peft
). For instance, the LoraLayer
often initializes lora_B
weights to zeros and lora_A
weights using Kaiming uniform initialization.
An alternative approach is to initialize both matrices A and B using a random distribution, typically Gaussian (N(0,σ2)) with a carefully chosen (usually small) variance σ2.
In this case, at t=0, the adaptation term ΔW=BA will be a non-zero matrix, albeit likely one with small entries if σ is small.
Wt=0=W0+BA=W0(unless BA happens to be zero)Rationale:
The motivation here is that starting with a small, non-zero random adaptation might allow the model to start learning the required adjustments more quickly, potentially accelerating convergence. The initial random ΔW provides an immediate, albeit noisy, direction for adaptation.
Considerations:
Advantages:
Disadvantages:
This chart conceptually illustrates how the magnitude of the LoRA update term BA might evolve. Zero-initialization for B starts the update at zero, while Gaussian initialization for both A and B starts with a small, non-zero update.
For most practical applications, starting with the default strategy (zero-initialize B, random-initialize A) is recommended. It provides a stable and reliable baseline that preserves the integrity of the pre-trained model at the onset of fine-tuning. This approach is less sensitive to hyperparameter choices compared to initializing both matrices randomly.
Consider experimenting with random initialization for both A and B only if:
Remember that the effective initial update magnitude is also influenced by the LoRA rank r and the scaling factor α. The update is W0+αrBA. Even with Gaussian initialization for A and B, a very small α or a very large r will diminish the initial perturbation caused by BA. Conversely, a large α or small r will amplify it. These factors interact, and tuning them together is part of optimizing LoRA performance.
While other initialization schemes could be devised, perhaps drawing inspiration from matrix factorization techniques or incorporating prior knowledge about the task, the Kaiming/Gaussian for A and zero for B remains the prevalent and robust starting point for applying LoRA effectively.
© 2025 ApX Machine Learning