While adaptive methods like Adam, RMSprop, and AdaGrad adjust learning rates based on the history of gradients, often providing significant improvements over fixed learning rates, applying a global learning rate schedule can yield further benefits in stability and performance. A learning rate schedule systematically adjusts the learning rate η over the course of training, typically decreasing it as training progresses.
The core idea is intuitive: start with a relatively large learning rate to make substantial progress early on when parameters are likely far from optimal values. As training converges and parameters approach a good solution, decrease the learning rate to allow for finer adjustments, reducing the risk of overshooting the minimum and helping to navigate complex loss surfaces more effectively. Even when using adaptive optimizers, which manage per-parameter rates, applying a schedule to the global base learning rate η is a common and often effective practice.
Let's examine several popular scheduling strategies.
Several functions are commonly used to define how the learning rate changes, usually as a function of the epoch number or iteration count t.
Step Decay: This is one of the simplest schedules. The learning rate is kept constant for a fixed number of epochs (a "step") and then decreased by a certain factor. For example, you might halve the learning rate every 10 epochs.
Exponential Decay: Provides a smoother decrease than step decay. The learning rate is multiplied by a decay factor less than 1 after each epoch (or sometimes, each iteration).
decay_factor
controls the rate of decrease.decay_factor
.Inverse Time Decay (1/t Decay): Decreases the learning rate proportionally to the inverse of the iteration or epoch number.
Cosine Annealing: This popular schedule decreases the learning rate following the shape of a cosine curve over a defined period T. It starts at an initial rate ηmax and smoothly anneals down to a minimum rate ηmin (often 0).
{"layout": {"title": "Comparison of Learning Rate Schedules", "xaxis": {"title": "Epoch"}, "yaxis": {"title": "Learning Rate", "range": [0, 0.105]}, "height": 350, "width": 600, "margin": {"l": 50, "r": 20, "t": 40, "b": 40}}, "data": [{"name": "Step Decay (factor=0.5, step=20)", "x": [0, 19, 20, 39, 40, 59, 60, 79, 80, 100], "y": [0.1, 0.1, 0.05, 0.05, 0.025, 0.025, 0.0125, 0.0125, 0.00625, 0.00625], "mode": "lines", "line": {"color": "#1c7ed6"}}, {"name": "Exponential Decay (k=0.03)", "x": [i for i in range(101)], "y": [0.1 * (0.97**i) for i in range(101)], "mode": "lines", "line": {"color": "#74b816"}}, {"name": "Cosine Annealing (T=50)", "x": [i for i in range(101)], "y": [0.0 + 0.5 * (0.1 - 0.0) * (1 + np.cos( (i % 50) * np.pi / 50)) for i in range(101)], "mode": "lines", "line": {"color": "#f76707"}}]}
Different learning rate schedules over 100 epochs with an initial learning rate of 0.1. Step decay reduces the rate abruptly, exponential decay decreases smoothly, and cosine annealing follows a periodic pattern (shown here with restarts every 50 epochs).
Cyclical Learning Rates (CLR): Instead of monotonically decreasing the rate, CLR varies it cyclically between a minimum (ηmin) and maximum (ηmax) bound. Common forms include triangular or cosine-based cycles.
Warmup: Particularly relevant for training very deep networks (like Transformers) or when using large batch sizes, a warmup phase is often employed at the beginning of training. During warmup, the learning rate starts very low (e.g., 0 or near 0) and gradually increases to the target initial learning rate (η0 or ηmax) over a specified number of initial epochs or iterations (e.g., linearly or quadratically).
Learning rate schedules are frequently used in conjunction with adaptive optimizers like Adam or RMSprop. The schedule typically controls the global base learning rate ηt at each step t. The adaptive optimizer then uses this ηt along with its internal state (e.g., estimates of first and second moments of gradients) to compute the final parameter updates.
For Adam, the update rule is: θt+1=θt−ηtv^t+ϵm^t Here, ηt is the learning rate determined by the active schedule at time step t. The schedule provides a global, time-dependent modulation, while Adam provides local, parameter-specific adaptation based on gradient statistics. This combination often proves highly effective in practice.
# Conceptual Example (Pseudocode-like)
initial_lr = 0.001
optimizer = Adam(parameters, lr=initial_lr) # Adam uses initial_lr
# Example: Cosine Annealing with Warmup
num_epochs = 100
warmup_epochs = 10
scheduler = CosineAnnealingWithWarmup(optimizer,
warmup_epochs=warmup_epochs,
total_epochs=num_epochs,
min_lr=1e-6)
for epoch in range(num_epochs):
# Training loop ...
# train_one_epoch(model, dataloader, optimizer)
# Update learning rate at the end of each epoch
scheduler.step()
# Optional: Log current learning rate
# current_lr = scheduler.get_last_lr()[0]
# print(f"Epoch {epoch+1}, LR: {current_lr}")
# Validation loop ...
# validate(model, val_dataloader)
Structure showing how a scheduler might be integrated into a training loop, updating the optimizer's learning rate based on the epoch number.
Understanding and appropriately applying learning rate schedules adds another powerful tool to your optimization toolkit. While adaptive methods handle much of the per-parameter rate adjustment, schedules provide essential global control over the learning process, facilitating faster convergence, better stability, and potentially improved final model performance. Experimentation and careful monitoring remain significant components of finding the best scheduling strategy for your specific machine learning task.
© 2025 ApX Machine Learning