Learning Rates and Schedulers

Learning rates dictate the magnitude of weight updates applied during backpropagation during neural network optimization. Fine-tuning a Small Language Model with parameter-efficient techniques like LoRA requires a different learning rate strategy than training a model from scratch. Because the optimization process only updates a small subset of newly initialized adapter weights, higher learning rates are typically used. Common starting values for LoRA adapters range between $1e-4$ and $3e-4$ , whereas full fine-tuning of all model weights might require much lower rates closer to $1e-5$ .

Applying a constant learning rate throughout the entire training loop is rarely effective. Early in the fine-tuning process, the adapter weights are randomly initialized. This random state produces high initial loss and correspondingly large gradients. If the optimizer applies a maximum learning rate immediately to these large gradients, the model parameters can experience severe mathematical instability and diverge. To manage this behavior, the training loop relies on a learning rate scheduler.

A scheduler dynamically adjusts the step size of the optimizer based on the current training step. The standard approach begins with a warmup phase. During this phase, the learning rate increases linearly from exactly zero to the maximum configured learning rate over a specific number of steps. This gradual scaling allows the model to stabilize its initial weights before taking larger optimization steps. You can configure this warmup period using a fixed number of steps or as a percentage of the total training duration.

After reaching the peak learning rate at the end of the warmup phase, the learning rate must decrease. As the model approaches a minimum in the loss function, taking large steps can cause the optimization process to bounce around or completely overshoot the minimum value. A decay strategy gradually reduces the step size to help the model settle into a final configuration.

Recall the exponential decay formula introduced earlier:

$\alpha_{t} = \alpha_{initial} \cdot \beta^{t}$

While this mathematical foundation is common in classic machine learning, modern language model training frequently relies on a cosine decay schedule. A cosine scheduler follows the shape of a cosine curve. It drops slowly at first, decreases almost linearly through the middle of training, and finally tapers off very slowly near zero. This curve provides a long period of productive training followed by a gentle settling phase.

Learning rate progression over 1000 training steps featuring a 100-step linear warmup followed by a cosine decay to zero.

When configuring the training loop using the Hugging Face Transformers library, these scheduling parameters are passed directly into the training arguments configuration. You specify the peak learning rate, define the scheduler type as cosine, and set a warmup ratio. A warmup ratio of 0.03 means the linear increase will consume the first three percent of your total training steps. Defining the schedule as a ratio rather than a fixed number of steps ensures your warmup phase scales automatically if you add more data or increase the number of epochs.

Configuring these values correctly requires observation. If your learning rate is too high or your warmup is too short, your training loss may spike to an invalid value and halt the process. If the learning rate is too low or decays too quickly, the loss curve will flatten out early, and the model will fail to adapt to the new instruction format. Adjusting these schedules based on your training logs is a standard part of the fine-tuning iteration cycle.

References

LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2022 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.2106.09685 - The original paper introducing LoRA, providing empirical evidence for the specific learning rate ranges used in parameter-efficient fine-tuning.
SGDR: Stochastic Gradient Descent with Warm Restarts, Ilya Loshchilov, Frank Hutter, 2017 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1608.03983 - This paper introduces the cosine annealing schedule, which forms the mathematical basis for the cosine decay mentioned in the section.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - Chapter 8 provides a formal explanation of how learning rates and optimization algorithms influence neural network training stability.