In the previous section, we introduced Gradient Descent as the engine driving the learning process in neural networks. It iteratively adjusts the model's weights and biases to minimize the loss function. But how much should we adjust the weights in each step? This is controlled by a critical hyperparameter: the learning rate.
Think of gradient descent as trying to find the lowest point in a valley (the minimum of the loss function). The gradient tells you the direction of the steepest ascent, so you move in the opposite direction to go downhill. The learning rate determines the size of the step you take in that downhill direction.
Mathematically, the learning rate, often denoted by the Greek letter alpha (α) or eta (η), is a small positive scalar value. It multiplies the gradient before it's subtracted from the current weight. The update rule for a single weight w looks like this:
wnew=wcurrent−α∂wcurrent∂LHere, ∂wcurrent∂L is the gradient of the loss function L with respect to the current weight wcurrent. The learning rate α scales this gradient.
Choosing the right learning rate is significant for successful model training. The value of α directly influences both the speed of convergence and whether the algorithm converges at all.
Learning Rate Too Small: If α is very small, the weight updates will be tiny. The algorithm will take extremely small steps downhill. While this might eventually lead to a good minimum, the training process will be incredibly slow, potentially requiring a vast number of epochs (passes through the training data) to reach convergence. In complex loss landscapes, it might also increase the risk of getting stuck in a poor local minimum because the steps aren't large enough to jump out of small dips.
Learning Rate Too Large: If α is too large, the steps taken might be too big. Instead of descending smoothly into the minimum, the algorithm might overshoot it entirely. It could bounce back and forth across the valley, failing to settle near the minimum. In worse cases, the steps might be so large that the loss actually increases with each update, leading the algorithm to diverge completely. The weights can explode to very large values (NaN or infinity), and the training process fails.
Consider a simple parabolic loss function. The plot below illustrates how different learning rates affect the path gradient descent takes towards the minimum.
Example paths for gradient descent on a simple loss function L=w2 starting at w=5. A suitable learning rate converges steadily. A small learning rate makes slow progress. A large learning rate overshoots and oscillates, potentially diverging.
Selecting an appropriate learning rate is often more of an art than a science initially and typically involves experimentation. Common starting values are often in the range of 0.1 to 0.0001, but the optimal value depends heavily on:
Techniques like learning rate scheduling (gradually decreasing the learning rate during training) and adaptive learning rate methods (like Adam or RMSprop, covered in Chapter 4) have been developed to help manage this parameter more effectively. However, understanding the fundamental role of the base learning rate is essential before moving to these more advanced techniques. You'll often need to tune this hyperparameter as part of the model development process, monitoring the training loss to see if it's decreasing steadily at a reasonable pace.
© 2025 ApX Machine Learning