While a fixed learning rate can work for simpler problems, training large, complex models often benefits significantly from adjusting the learning rate dynamically during the optimization process. A carefully chosen learning rate schedule can accelerate convergence, help navigate complex loss landscapes, and lead to better final model performance compared to using a single, static learning rate. PyTorch provides a flexible framework for implementing various scheduling strategies through the torch.optim.lr_scheduler
module.
The core idea behind most learning rate schedules is intuitive: start with a relatively large learning rate to make substantial progress in the initial phases of training when parameters are far from optimal. Then, as training progresses and the model approaches a potential minimum, gradually decrease the learning rate to allow for finer adjustments and prevent overshooting the optimal point. This dynamic adjustment helps balance exploration and exploitation during the search for optimal model parameters.
PyTorch offers several built-in schedulers, allowing you to implement sophisticated learning rate adjustments with minimal code. Let's examine some of the most effective and commonly used strategies in advanced training regimes.
Cosine annealing is a popular scheduling technique that smoothly decreases the learning rate following a cosine curve. It starts at the initial learning rate specified in the optimizer and gradually lowers it to a minimum value (eta_min
) over a defined number of epochs or steps (T_max
). The learning rate ηt at epoch t is calculated as:
where ηmax is the initial learning rate.
This smooth, gradual decay helps the optimizer settle into good minima towards the end of training without the abrupt changes associated with step-based decay methods.
Here's how you implement CosineAnnealingLR
in PyTorch:
import torch
from torch.optim import SGD
from torch.optim.lr_scheduler import CosineAnnealingLR
import matplotlib.pyplot as plt # For visualization
# Example setup
model_params = [torch.randn(10, 5, requires_grad=True)]
optimizer = SGD(model_params, lr=0.1)
# Cosine Annealing: Anneal LR from 0.1 down to 0 over 100 epochs
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=0)
# Simulate training loop to visualize LR changes
lrs = []
for epoch in range(150): # Simulate more epochs than T_max
# optimizer.step() # Normally called after loss.backward()
lrs.append(optimizer.param_groups[0]['lr'])
scheduler.step()
# # Simple plotting (replace with Plotly for course integration)
# plt.figure()
# plt.plot(range(150), lrs)
# plt.title("CosineAnnealingLR (T_max=100)")
# plt.xlabel("Epoch")
# plt.ylabel("Learning Rate")
# plt.grid(True)
# plt.show()
Notice how the learning rate decreases following the cosine curve until T_max
(100 epochs) and then stays at eta_min
(0) for subsequent epochs.
A variation is CosineAnnealingWarmRestarts
. Instead of annealing just once, this scheduler restarts the cosine annealing cycle periodically. It anneals the learning rate over T_0
epochs, then "restarts" by resetting the learning rate back to its initial value and beginning a new annealing cycle. The length of subsequent cycles can optionally be increased by a factor T_mult
.
Here, Ti is the length of the current cycle (initially T0, multiplied by Tmult after each restart), and Tcur is the number of epochs elapsed since the last restart.
These "warm restarts" can help the optimizer escape suboptimal local minima that it might have settled into during an annealing phase.
import torch
from torch.optim import AdamW # Often used with advanced schedules
from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
import matplotlib.pyplot as plt # For visualization
# Example setup
model_params = [torch.randn(10, 5, requires_grad=True)]
optimizer = AdamW(model_params, lr=0.01) # Initial LR
# Cosine Annealing with Warm Restarts:
# Restart every 50 epochs (T_0=50).
# Double the cycle length after each restart (T_mult=2).
# Minimum LR is 1e-5.
scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=50, T_mult=2, eta_min=1e-5)
# Simulate training loop
lrs_restarts = []
num_epochs = 300 # T_0 + T_0*T_mult + T_0*T_mult*T_mult = 50 + 100 + 200 = 350
for epoch in range(num_epochs):
# optimizer.step()
lrs_restarts.append(optimizer.param_groups[0]['lr'])
scheduler.step()
# # Simple plotting
# plt.figure()
# plt.plot(range(num_epochs), lrs_restarts)
# plt.title("CosineAnnealingWarmRestarts (T_0=50, T_mult=2)")
# plt.xlabel("Epoch")
# plt.ylabel("Learning Rate")
# plt.grid(True)
# plt.show()
This scheduler creates cycles of annealing, with each cycle potentially lasting longer than the previous one.
Starting training directly with a large learning rate, especially when using adaptive optimizers like Adam or AdamW, can sometimes lead to instability or divergence early on. The initial parameter gradients might be large and noisy, causing significant, potentially detrimental updates.
A common technique to mitigate this is learning rate warmup. During the first few epochs (or batches) of training, the learning rate is gradually increased from a very small value (e.g., close to zero) up to the target initial learning rate. This allows the model to stabilize before larger updates are applied.
Warmup is typically not a standalone scheduler in PyTorch's lr_scheduler
module but is often implemented by combining schedulers or using LambdaLR
. A common approach is to implement a linear warmup followed by another decay schedule like cosine annealing.
Here's an example implementing linear warmup followed by cosine annealing using LambdaLR
:
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR, CosineAnnealingLR
import math
# Example setup
model_params = [torch.randn(10, 5, requires_grad=True)]
initial_lr = 0.01
optimizer = AdamW(model_params, lr=initial_lr)
# Parameters
warmup_epochs = 10
total_epochs = 100
cosine_epochs = total_epochs - warmup_epochs
# Scheduler 1: Linear Warmup
def lr_lambda_warmup(current_epoch):
if current_epoch < warmup_epochs:
return float(current_epoch + 1) / float(max(1, warmup_epochs))
else:
# After warmup, let the cosine scheduler take over indirectly
# We calculate the decay factor relative to the end of the warmup phase
progress = float(current_epoch - warmup_epochs) / float(max(1, cosine_epochs))
cosine_decay = 0.5 * (1.0 + math.cos(math.pi * progress))
return cosine_decay # This factor is multiplied by the initial_lr
scheduler = LambdaLR(optimizer, lr_lambda=lr_lambda_warmup)
# Simulate training loop
lrs_warmup_cosine = []
for epoch in range(total_epochs + 20): # Simulate a bit longer
# optimizer.step()
lrs_warmup_cosine.append(optimizer.param_groups[0]['lr'])
scheduler.step()
# # Simple plotting
# plt.figure()
# plt.plot(range(total_epochs + 20), lrs_warmup_cosine)
# plt.title("Linear Warmup (10 epochs) + Cosine Annealing")
# plt.xlabel("Epoch")
# plt.ylabel("Learning Rate")
# plt.grid(True)
# plt.show()
This approach uses a single LambdaLR
scheduler whose lr_lambda
function incorporates both the warmup logic and the subsequent cosine decay logic. Note that newer PyTorch versions also offer SequentialLR
and ChainedScheduler
for more explicitly combining different schedulers.
While cosine annealing and warmup are highly prevalent, other schedulers exist:
StepLR
: Decays the learning rate by a factor gamma
every step_size
epochs. Simple, but the sudden drops can sometimes disrupt training momentum.MultiStepLR
: Similar to StepLR
, but allows specifying the exact epochs (milestones
) at which to decay the LR.ExponentialLR
: Decays the learning rate by a factor gamma
every epoch.PolynomialLR
: Decays the learning rate following a polynomial function, offering flexibility between linear and other decay shapes.ReduceLROnPlateau
: Reduces the learning rate when a monitored metric (e.g., validation loss) stops improving for a specified number of epochs (patience
). This is adaptive based on performance rather than a fixed schedule.Using a learning rate scheduler in PyTorch typically involves two steps:
step()
method at the appropriate point in your training loop.The placement of scheduler.step()
is important:
StepLR
, CosineAnnealingLR
, CosineAnnealingWarmRestarts
(when defined by epochs), MultiStepLR
, etc., you typically call scheduler.step()
once per epoch, usually after the validation loop or at the end of the training epoch.scheduler.step()
after each batch (i.e., after optimizer.step()
). Always consult the specific scheduler's documentation. For most common epoch-based schedulers mentioned here, stepping once per epoch is standard. Note that ReduceLROnPlateau
requires the metric value passed to its step()
method (e.g., scheduler.step(validation_loss)
).# Typical Training Loop Snippet (Epoch-based stepping)
optimizer = AdamW(model.parameters(), lr=initial_lr)
# scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)
scheduler = # Initialize your chosen scheduler here
for epoch in range(num_epochs):
model.train()
for batch in train_loader:
inputs, labels = batch
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# Batch-based scheduler step would go here if needed
# --- Epoch End ---
# Perform validation, logging, etc.
# Step the scheduler (for epoch-based schedulers)
scheduler.step() # For ReduceLROnPlateau: scheduler.step(validation_loss)
print(f"Epoch {epoch+1}/{num_epochs}, LR: {optimizer.param_groups[0]['lr']:.6f}")
Visualizing the learning rate over time is helpful for understanding and debugging your chosen schedule.
Comparison of different learning rate schedules over epochs (log scale y-axis). Cosine Annealing shows a smooth decay, Warm Restarts introduces periodic resets, and Warmup+Cosine includes an initial ramp-up phase.
The optimal learning rate schedule and its parameters (initial LR, T_max
, T_0
, T_mult
, eta_min
, warmup duration) are highly dependent on the specific problem, dataset, model architecture, and chosen optimizer. There's no single best schedule for all situations.
ReduceLROnPlateau
can be effective when progress is directly tied to a measurable validation metric, but it can be slow to react if the metric plateaus frequently for reasons other than LR.Experimentation is key. Visualizing the planned schedule, monitoring training/validation loss curves, and leveraging hyperparameter optimization tools (discussed later in this chapter) are essential for finding the most effective scheduling strategy for your specific task. Remember that the interaction between the optimizer and the learning rate schedule is significant, so they should often be tuned together.
© 2025 ApX Machine Learning