While adjusting the learning rate during training using schedules like step decay or exponential decay helps fine-tune convergence later on, the very beginning of training presents its own unique challenge. When a network's weights are freshly initialized, often randomly, the initial gradients calculated can be quite large or point in directions that are not immediately helpful for stable learning. Applying a relatively large initial learning rate right from the first step can sometimes lead to instability, causing the loss to oscillate wildly or even diverge. This is particularly true for complex models or when using certain adaptive optimizers whose internal states need time to stabilize.
Learning rate warmup addresses this initial instability. The core idea is straightforward: instead of starting with your target initial learning rate (say, α=0.001), you begin with a much smaller learning rate and gradually increase it over a predefined number of initial training steps or epochs. Once this "warmup phase" is complete, the learning rate reaches its target initial value, and then a standard learning rate schedule (like step decay, exponential decay, or cosine annealing) can take over for the remainder of the training process.
The primary benefit of warmup is training stabilization. By using small learning rates initially, you allow the model parameters to adjust gently, preventing large, potentially disruptive updates that could occur if the initial gradients are erratic. It gives the optimization process time to "settle down" before taking larger steps.
This technique is often found to be particularly helpful in specific scenarios:
The most common approach is linear warmup. In this strategy, the learning rate αt at step t increases linearly from a small starting value αstart (sometimes even 0) to the target initial learning rate αtarget over Nwarmup steps:
αt=αstart+(αtarget−αstart)×Nwarmuptfor 0≤t<NwarmupAfter Nwarmup steps, the learning rate is αtarget, and a subsequent decay schedule usually begins.
Another simpler, though less common, variant is constant warmup, where a small constant learning rate is used for Nwarmup steps, after which the learning rate abruptly jumps to αtarget.
The duration of the warmup phase, Nwarmup, is a hyperparameter. It might be specified as a number of training steps (e.g., 1000 steps) or a number of epochs (e.g., 5 epochs). The optimal duration depends on the dataset, model, batch size, and optimizer being used.
A learning rate schedule combining linear warmup for the first 10 steps (from 10−5 to 10−3) followed by a step decay at step 30 (to 10−4). Note the logarithmic scale on the y-axis.
Many deep learning frameworks provide mechanisms to implement warmup, often by composing schedulers or using custom lambda functions. For instance, in PyTorch, you could potentially implement linear warmup combined with another scheduler like StepLR
or ExponentialLR
using LambdaLR
or by chaining schedulers with SequentialLR
.
Here's a conceptual example using LambdaLR
for linear warmup:
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR
# Assume model and optimizer are defined
# optimizer = optim.Adam(model.parameters(), lr=0.001) # Initial LR set here doesn't matter for LambdaLR
target_lr = 0.001
warmup_steps = 1000
def lr_lambda(current_step: int):
if current_step < warmup_steps:
# Linear warmup
return float(current_step) / float(max(1, warmup_steps))
else:
# After warmup, use a constant factor (or implement decay here)
# For example, constant after warmup:
return 1.0
# Example: Step decay after warmup
# decay_start_step = 10000
# decay_factor = 0.1
# if current_step >= decay_start_step:
# return decay_factor
# else:
# return 1.0
# Note: The optimizer's initial lr acts as the base_lr multiplier
optimizer = optim.Adam(model.parameters(), lr=target_lr) # Set optimizer LR to the target LR
scheduler = LambdaLR(optimizer, lr_lambda)
# Training loop
for step in range(num_training_steps):
# ... perform training step ...
optimizer.step()
scheduler.step() # Update learning rate
optimizer.zero_grad()
This snippet defines a lambda function that increases the learning rate multiplier linearly from near zero up to 1.0 over warmup_steps
. The optimizer
is initialized with the target_lr
, and the scheduler scales this value according to the lambda function's output at each step. After the warmup period, this simple example keeps the learning rate constant at target_lr
, but you could easily modify the lr_lambda
function to implement a decay schedule afterward. Libraries like transformers
from Hugging Face often provide pre-built schedulers that include warmup phases.
In summary, learning rate warmup is a valuable technique for stabilizing the initial phase of deep learning training, preventing potential divergence or oscillations caused by large initial learning rates acting on randomly initialized weights or unstable adaptive optimizer states. It's often used in conjunction with other learning rate decay schedules and is particularly beneficial for complex models and large batch training.
© 2025 ApX Machine Learning