Standard gradient descent methods, which we revisited in Chapter 1, typically rely on a single hyperparameter: the learning rate, often denoted as η. This value scales the gradient ∇J(θ) to determine the size of the step we take in the parameter space at each iteration: θt+1=θt−η∇J(θt). While simple, choosing an effective fixed η is a significant practical challenge.
The core issue lies in the sensitivity of the training process to this single value. Set η too small, and the optimizer takes minuscule steps. Convergence becomes painfully slow, potentially requiring an impractical number of iterations to reach a satisfactory minimum. Imagine trying to descend a mountain taking only pebble sized steps; you'll get there eventually, but it will take a very long time, and you might get stuck on relatively flat areas (plateaus) that still aren't the true valley floor.
Conversely, setting η too large introduces instability. The optimizer might overshoot the minimum entirely, bouncing back and forth across the optimal region without settling. In worse cases, the updates can be so large that the loss function actually increases, leading to divergence. Think of leaping down the mountain; you might jump right over the lowest point or even launch yourself off a cliff.
Convergence behavior for different fixed learning rates on a hypothetical convex problem. Small rates converge slowly, large rates can oscillate or diverge.
Finding the "Goldilocks" rate often requires considerable manual tuning and experimentation, which can be computationally expensive. Furthermore, the optimal fixed learning rate itself might not even exist. The characteristics of the loss landscape often change during training. Early on, when parameters are far from optimal, larger steps might be beneficial to make rapid progress. However, as training nears a minimum, smaller, more careful steps are usually needed to avoid overshooting. A single η cannot effectively cater to both regimes.
This "one size fits all" approach also ignores the potential benefits of parameter specific adjustments. Consider a model processing text data with sparse features. Parameters associated with common features might receive frequent, possibly noisy gradient updates. Parameters for rare features receive infrequent but potentially very informative updates when they occur. It could be advantageous to use smaller updates for the frequently updated parameters (to average out noise and prevent instability) and larger updates for the sparsely updated ones (to ensure they learn effectively from the limited data they see). A global, fixed η treats all parameters identically, regardless of their update frequency or the scale of their gradients, which can hinder efficient learning.
Finally, the complex, non convex nature of loss surfaces in deep learning, which we discussed in Chapter 1, exacerbates these issues. Fixed learning rates can struggle immensely with saddle points (where the gradient is near zero but it's not a minimum) and extensive plateaus (flat regions). A small η can cause the optimizer to crawl across these features at an impractically slow pace. A large η might help escape plateaus faster but increases the risk of oscillation or divergence, especially when navigating narrow valleys or approaching sharp minima.
These inherent difficulties with selecting and applying a single, fixed learning rate motivate the need for more sophisticated approaches. The algorithms discussed in the following sections AdaGrad, RMSprop, Adam, and their relatives directly address these limitations by dynamically adapting the learning rate during training, often on a per parameter basis. This adaptability generally leads to faster convergence across a wider range of problems and architectures, often with less manual tuning required.
© 2025 ApX Machine Learning