In the previous chapter, we looked at optimizers like Stochastic Gradient Descent (SGD), Momentum, and Nesterov Accelerated Gradient (NAG). These methods are workhorses for training deep learning models. A common element among them is the use of a single, global learning rate, often denoted as α. This learning rate determines the step size for updating all the model's parameters based on the computed gradients. While techniques like learning rate scheduling (which we'll cover later) can adjust this global rate over time, it remains the same for every parameter within a single update step.
This uniform approach, however, can be inefficient or even problematic when training complex deep neural networks. Why? Because the "ideal" step size might not be the same for all parameters. Let's consider a few scenarios where a one-size-fits-all learning rate falls short.
Consider a simple visualization of an elongated loss surface:
This contour plot shows a loss function where progress is much easier along the horizontal axis (w1) than the steep vertical axis (w2).
In such a scenario, the gradient will be much steeper along the w2 direction than the w1 direction. If we use a global learning rate α:
SGD with Momentum or NAG can help smooth out oscillations and accelerate progress along shallow directions, but they still operate with that single learning rate, limiting their effectiveness on highly non-spherical surfaces.
Ideally, we want an optimization algorithm that can adapt its step size for each parameter independently. It should:
This is precisely the motivation behind adaptive learning rate algorithms. Methods like AdaGrad, RMSprop, and Adam maintain information about the past gradients for each parameter and use this history to scale the learning rate individually. They effectively give each parameter its own learning rate that changes dynamically during training.
By adjusting the learning rate on a per-parameter basis, these algorithms can often navigate complex loss surfaces more effectively, leading to faster convergence and sometimes finding better solutions compared to standard SGD or its momentum variants, especially with default hyperparameter settings.
In the following sections, we will look into the specific mechanisms of AdaGrad, RMSprop, and Adam to understand how they achieve this adaptive behavior.
© 2025 ApX Machine Learning