Now that we understand why adjusting the learning rate during training is beneficial, let's look at how to implement these learning rate scheduling techniques in practice using a common deep learning framework like PyTorch.
Most deep learning libraries provide convenient ways to attach learning rate schedulers directly to your optimizers. The general workflow involves:
step()
method at the appropriate point within your training loop (usually after each epoch, but sometimes after each batch).Let's examine how to implement some of the common scheduling methods discussed earlier.
PyTorch offers a variety of built-in learning rate schedulers in its torch.optim.lr_scheduler
module. We'll demonstrate a few popular ones.
First, assume you have defined your model and an optimizer, for instance:
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, ExponentialLR, CosineAnnealingLR
# Assume 'model' is your defined neural network
# Initial learning rate
initial_lr = 0.01
# Initialize the optimizer (e.g., Adam)
optimizer = optim.Adam(model.parameters(), lr=initial_lr)
# We will define the scheduler next...
Now, let's attach different schedulers.
StepLR
)Step decay reduces the learning rate by a factor (gamma
) every step_size
epochs.
# Define StepLR scheduler
# Decay LR by gamma=0.1 every step_size=10 epochs
scheduler_step = StepLR(optimizer, step_size=10, gamma=0.1)
# Example usage within a training loop (simplified):
# num_epochs = 50
# for epoch in range(num_epochs):
# # --- Training phase ---
# # model.train()
# # for batch in dataloader:
# # optimizer.zero_grad()
# # outputs = model(batch.features)
# # loss = compute_loss(outputs, batch.labels)
# # loss.backward()
# # optimizer.step()
# # --- End Training phase ---
#
# # Update the learning rate after each epoch
# scheduler_step.step()
#
# current_lr = optimizer.param_groups[0]['lr']
# print(f"Epoch {epoch+1}: Current LR = {current_lr:.6f}")
This scheduler will keep the learning rate at 0.01 for epochs 1-10, then drop it to 0.001 for epochs 11-20, then 0.0001 for epochs 21-30, and so on.
StepLR learning rate decay over 50 epochs with an initial rate of 0.01, step size of 10, and gamma of 0.1. The y-axis is logarithmic.
ExponentialLR
)Exponential decay multiplies the learning rate by a factor (gamma
) after every epoch.
# Define ExponentialLR scheduler
# Decay LR by gamma=0.95 every epoch
scheduler_exp = ExponentialLR(optimizer, gamma=0.95)
# Example usage within a training loop (simplified):
# num_epochs = 50
# for epoch in range(num_epochs):
# # --- Training phase (as above) ---
#
# # Update the learning rate after each epoch
# scheduler_exp.step()
#
# current_lr = optimizer.param_groups[0]['lr']
# print(f"Epoch {epoch+1}: Current LR = {current_lr:.6f}")
Here, the learning rate decreases more smoothly compared to step decay: 0.01, 0.01×0.95, 0.01×0.952, and so on.
ExponentialLR learning rate decay over 50 epochs with an initial rate of 0.01 and gamma of 0.95.
CosineAnnealingLR
)Cosine annealing gradually decreases the learning rate following a cosine curve, often reaching a minimum value (eta_min
) over a specified number of epochs (T_max
). It can optionally restart the annealing cycle.
# Define CosineAnnealingLR scheduler
# Anneal over T_max=50 epochs, minimum LR eta_min=0
scheduler_cos = CosineAnnealingLR(optimizer, T_max=50, eta_min=0)
# Example usage within a training loop (simplified):
# num_epochs = 50
# for epoch in range(num_epochs):
# # --- Training phase (as above) ---
#
# # Update the learning rate after each epoch
# scheduler_cos.step()
#
# current_lr = optimizer.param_groups[0]['lr']
# print(f"Epoch {epoch+1}: Current LR = {current_lr:.6f}")
This scheduler smoothly decreases the learning rate from the initial value down to eta_min
over the T_max
epochs.
CosineAnnealingLR learning rate decay over 50 epochs with an initial rate of 0.01, T_max of 50, and eta_min of 0.
Learning rate warmup, where the learning rate starts very low and increases linearly or non-linearly over a few initial epochs or steps, isn't always a built-in scheduler class by itself. It's often combined with another decay schedule.
You might implement warmup manually or use utility functions/classes provided by libraries like Hugging Face's transformers
. A simple conceptual approach for linear warmup over warmup_epochs
:
# Conceptual manual warmup (within the training loop)
initial_lr = 0.01
warmup_epochs = 5
num_epochs = 50
# Assume scheduler_main is another scheduler like CosineAnnealingLR starting after warmup
for epoch in range(num_epochs):
if epoch < warmup_epochs:
# Linearly increase LR from 0 to initial_lr
lr = initial_lr * (epoch + 1) / warmup_epochs
for param_group in optimizer.param_groups:
param_group['lr'] = lr
elif epoch == warmup_epochs:
# Ensure LR is exactly initial_lr after warmup
for param_group in optimizer.param_groups:
param_group['lr'] = initial_lr
# Initialize the main scheduler here if needed, or just let it take over
# scheduler_main = CosineAnnealingLR(optimizer, T_max=num_epochs - warmup_epochs, eta_min=0)
else:
# Apply the main scheduler after warmup is complete
# scheduler_main.step() # Call step on the primary scheduler
pass # Placeholder for main scheduler step if initialized earlier
# --- Training phase ---
current_lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch+1}: Current LR = {current_lr:.6f}")
Note: PyTorch 1.10+ introduced
SequentialLR
andChainedScheduler
, which provide more elegant ways to combine schedulers like warmup followed by decay.
The critical part is calling scheduler.step()
at the correct time. Most schedulers, like StepLR
, ExponentialLR
, and CosineAnnealingLR
, are designed to be called once per epoch, typically after the validation loop for that epoch.
# Standard Training Loop Structure with Epoch-based Scheduler
# Initialize optimizer and scheduler
optimizer = optim.Adam(model.parameters(), lr=0.01)
scheduler = StepLR(optimizer, step_size=10, gamma=0.1) # Example
num_epochs = 50
for epoch in range(num_epochs):
# --- Training Phase ---
model.train()
for inputs, labels in train_loader:
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step() # Update weights based on current gradients
# --- Validation Phase (Optional) ---
# model.eval()
# ... validation logic ...
# --- Learning Rate Update ---
# Update the learning rate *after* the training and validation steps for the epoch
scheduler.step()
# Log learning rate (optional)
current_lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch+1} finished. New LR for next epoch: {current_lr:.6f}")
Important: Always call scheduler.step()
after optimizer.step()
. The order matters, especially with more complex interactions between them in some frameworks or custom setups. For schedulers designed to step per batch (less common, but possible), the scheduler.step()
call would be inside the batch loop. Check the documentation for the specific scheduler you are using.
By implementing learning rate scheduling, you gain finer control over the optimization process, often leading to faster convergence and better final model performance compared to using a fixed learning rate. Experimenting with different schedules and their parameters is a standard part of tuning deep learning models.
© 2025 ApX Machine Learning