While step decay schedules offer a simple way to reduce the learning rate, their abrupt drops can sometimes interrupt the training dynamics. For a smoother, more continuous adjustment, several alternative scheduling methods exist. Two widely used approaches are Exponential Decay and Cosine Annealing.
Exponential decay provides a gradual reduction of the learning rate over time. Instead of discrete steps, the learning rate is multiplied by a fixed decay factor less than 1 after a certain number of steps or epochs.
The formula for exponential decay can be expressed as:
\alpha_t = \alpha_0 \times \text{decay_rate}^{(t / \text{decay_steps})}Where:
The effect is a learning rate that decreases faster initially and then slows its rate of decrease over time, offering a smoother transition compared to step decay. This continuous reduction can help the optimizer settle more gently into areas of the loss landscape.
Cosine Annealing is another popular scheduling technique that varies the learning rate cyclically, following the shape of a cosine curve. It starts at an initial maximum learning rate (αmax) and smoothly decreases towards a minimum learning rate (αmin, often 0) over a specified number of epochs (T).
The learning rate at epoch t (within a cycle of length T) is calculated as:
αt=αmin+21(αmax−αmin)(1+cos(Ttπ))The key characteristic of cosine annealing is its gradual decrease followed by a potential rapid decrease towards the end of the cycle. This schedule can be effective because it spends more time exploring the parameter space with higher learning rates early in the cycle and then refines the solution with lower rates towards the end.
A common variant is Cosine Annealing with Restarts (also known as Stochastic Gradient Descent with Warm Restarts or SGDR). In this approach, the cosine cycle is repeated multiple times during training. Each restart involves resetting the learning rate back to αmax and potentially increasing the cycle length T for subsequent cycles. These restarts can help the optimizer escape poor local minima and potentially find better, broader minima in the loss landscape.
Beyond exponential decay and cosine annealing, other methods exist, though they might be used less frequently:
The choice of scheduler depends on the problem and the desired training dynamics.
Here's a visualization comparing these schedules:
Comparison of Step Decay (blue), Exponential Decay (orange), and one cycle of Cosine Annealing (green) over 50 epochs, starting from an initial learning rate of 0.1. Note the smooth decrease of Exponential and Cosine compared to the sharp drops of Step Decay. The Cosine curve returns to the max value if restarts are used (shown partially after epoch 25 for illustration).
Deep learning frameworks provide built-in support for various learning rate schedulers. Here's how you might implement Exponential Decay or Cosine Annealing in PyTorch:
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import ExponentialLR, CosineAnnealingLR
from torch.nn import Linear # Example model component
# Assume 'model' is your neural network and 'optimizer' is already defined
# For example:
model = Linear(10, 2) # A simple linear layer
optimizer = optim.Adam(model.parameters(), lr=0.1)
# Option 1: Exponential Decay
# Decay LR by a factor of 0.9 every epoch
scheduler_exp = ExponentialLR(optimizer, gamma=0.9)
# Option 2: Cosine Annealing
# Anneal LR over 25 epochs per cycle
scheduler_cos = CosineAnnealingLR(optimizer, T_max=25, eta_min=0.001) # eta_min is the minimum learning rate
# --- Inside the training loop ---
num_epochs = 50
for epoch in range(num_epochs):
# --- Training steps for one epoch ---
# model.train()
# for data, target in train_loader:
# optimizer.zero_grad()
# output = model(data)
# loss = criterion(output, target)
# loss.backward()
# optimizer.step()
# --- End of training steps ---
# Update the learning rate using the chosen scheduler
# Choose one scheduler to activate:
# scheduler_exp.step()
scheduler_cos.step()
current_lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch+1}, Current LR: {current_lr:.6f}")
# --- Validation steps (optional) ---
# model.eval()
# ...
In this example, after defining the optimizer, we create a scheduler object (either ExponentialLR
or CosineAnnealingLR
). Inside the training loop, after each epoch completes (i.e., after processing all batches for that epoch), we call scheduler.step()
. This updates the learning rate within the optimizer
according to the chosen schedule for the next epoch. Remember to only use one scheduler at a time unless combining them intentionally, which requires more advanced configuration.
Experimenting with different learning rate schedules and their parameters is a significant part of the hyperparameter tuning process, allowing you to fine-tune the optimization path for better convergence and model performance.
© 2025 ApX Machine Learning