Setting the learning rate is one of the most significant hyperparameter choices you'll make when training deep neural networks. As mentioned regarding advanced optimizers, finding a single, optimal, fixed learning rate η that works well throughout the entire training process is often difficult. A learning rate that is too large can cause the optimization process to oscillate or even diverge, preventing convergence. Conversely, a learning rate that is too small can lead to excessively slow training or cause the optimizer to become permanently stuck in suboptimal local minima or saddle points.
Learning rate scheduling provides a strategy to adjust the learning rate dynamically during training. The general idea is to start with a relatively higher learning rate to make rapid progress initially and then gradually decrease it as training progresses. This allows the optimizer to settle into deeper, more optimal minima in the loss landscape during the later stages of training.
Several heuristics and predefined functions are commonly used to schedule the learning rate. Let's examine a few popular ones:
This is perhaps the simplest scheduling strategy. The learning rate is kept constant for a fixed number of epochs (or iterations) and then reduced by a specific factor.
For example, you might start with η=0.01, train for 30 epochs, then reduce it to η=0.001 for another 30 epochs, and finally reduce it to η=0.0001 for the remainder of the training.
Mathematically, if η0 is the initial learning rate, γ is the decay factor (e.g., 0.1), and S is the step size in epochs, the learning rate ηe at epoch e can be defined as:
ηe=η0×γ⌊e/S⌋While simple to implement, step decay requires careful manual tuning of the initial rate, the decay factor, and the step size, which can be sensitive to the dataset and model architecture.
Instead of discrete steps, exponential decay reduces the learning rate continuously after each epoch (or even each iteration). The learning rate ηe at epoch e is given by:
ηe=η0×e−k×eHere, η0 is the initial learning rate, and k is a decay rate hyperparameter. This provides a smoother decrease compared to step decay but still requires tuning η0 and k. A related approach uses a multiplicative factor γ<1 applied every epoch: ηe=η0×γe.
Cosine annealing offers a more sophisticated decay pattern. The learning rate decreases from an initial value ηmax to a minimum value ηmin (often 0) following a cosine curve over a specified number of epochs, Tmax. The learning rate ηt at epoch t (where t ranges from 0 to Tmax) is calculated as:
ηt=ηmin+21(ηmax−ηmin)(1+cos(Tmaxtπ))This schedule starts by decreasing the learning rate slowly, then accelerates the decrease in the middle, and slows down again towards the end. It's often used with "warm restarts," where the schedule is reset periodically (e.g., setting Tmax to a fraction of the total epochs and repeating the cycle). This allows the optimizer to potentially escape poor local minima by temporarily increasing the learning rate again.
An alternative approach, introduced by Leslie N. Smith, involves cyclically varying the learning rate between a lower bound (base_lr) and an upper bound (max_lr). Instead of monotonically decreasing the learning rate, CLR suggests that periodically increasing it can have benefits, such as helping the optimizer traverse saddle points more quickly and explore the loss landscape more effectively.
During training, the learning rate oscillates between base_lr and max_lr. Several functional forms can define this oscillation, with the "triangular" policy being common:
stepsize
hyperparameter. One full cycle takes 2×stepsize iterations.Other cycle shapes exist, such as triangular2
(where the maximum learning rate is halved after each cycle) or using an exponential decay within the cycle.
A significant advantage of CLR is that it provides a systematic way to find reasonable values for base_lr and max_lr using an "LR Range Test." This involves running the model for a few epochs while linearly increasing the learning rate from a very small value to a large one and recording the loss at each step.
You typically plot the loss against the learning rate (on a log scale). The optimal base_lr is often the value where the loss starts to decrease, and the optimal max_lr is the value just before the loss starts to explode or oscillate wildly.
A popular variant is the "1cycle" policy, also proposed by Smith. It uses a single cycle spanning the entire training duration (or slightly less).
This policy is often combined with a cyclical momentum schedule, where momentum decreases as the learning rate increases, and vice versa. The 1cycle policy has been shown to achieve good results quickly, sometimes achieving super-convergence (reaching high accuracy with fewer epochs than traditional schedules).
The following chart illustrates the behavior of Step Decay, Cosine Annealing, and a Triangular CLR policy over 100 epochs.
{"layout": {"title": "Comparison of Learning Rate Schedules", "xaxis": {"title": "Epoch"}, "yaxis": {"title": "Learning Rate", "type": "log", "range": [-3.5, -1.5]}, "legend": {"traceorder": "normal"}, "template": "plotly_white", "height": 400}, "data": [{"x": [0, 29, 30, 59, 60, 100], "y": [0.01, 0.01, 0.001, 0.001, 0.0001, 0.0001], "mode": "lines", "name": "Step Decay (Factor 0.1, Step 30)", "line": {"color": "#4263eb"}}, {"x": [i for i in range(101)], "y": [0.0001 + 0.5 * (0.01 - 0.0001) * (1 + Math.cos(i * Math.PI / 100)) for i in range(101)], "mode": "lines", "name": "Cosine Annealing (Max 0.01, Min 0.0001)", "line": {"color": "#12b886"}}, {"x": [i for i in range(101)], "y": [0.001 + (0.01 - 0.001) * Math.max(0, (1 - Math.abs(i / 25 - 2 * Math.floor(1 + i / 50) + 1))) for i in range(101)], "mode": "lines", "name": "Triangular CLR (Base 0.001, Max 0.01, Step 25)", "line": {"color": "#f76707"}}]}
Different learning rate schedules exhibit distinct patterns over training epochs. Step decay shows abrupt changes, cosine annealing provides a smooth curve, and cyclical rates oscillate between bounds. (Note: Y-axis is log scale).
step()
method at the end of each epoch (or sometimes each iteration, depending on the scheduler).Effectively scheduling the learning rate is a powerful technique for improving convergence speed and the final performance of your deep learning models. Experimenting with different schedules and their hyperparameters, guided by techniques like the LR Range Test for CLR, is an important part of the optimization process.
© 2025 ApX Machine Learning