Okay, let's refine our understanding of the gradient descent step. We've seen that we calculate the gradient of the cost function, which tells us the direction of the steepest increase in cost. Since our goal is to minimize the cost, we want to move in the opposite direction of the gradient.
But how far do we move in that opposite direction during each step? This is where the learning rate comes in. It's a small positive number, often denoted by the Greek letter alpha (α), that scales the size of the step we take.
Think back to our update rule for a parameter (like m or b in our linear regression example). The general form looks like this:
parameter = parameter - learning_rate * gradient_of_cost_wrt_parameter
For our specific parameters m and b, using the partial derivative notation for the gradient components, the updates are:
b=b−α∂b∂J m=m−α∂m∂JHere, J represents the cost function, ∂b∂J is the partial derivative of the cost with respect to b, ∂m∂J is the partial derivative of the cost with respect to m, and α is our learning rate.
The gradient tells us the direction to move, but not the distance. The magnitude (size) of the gradient vector indicates how steep the slope is. If we simply subtracted the full gradient from our parameters in each step, we might take huge steps when the slope is steep and tiny steps when it's flat. Taking giant steps, especially when we are far from the minimum, can cause problems.
Imagine you are hiking down a mountain in foggy conditions. You can feel the slope beneath your feet (the gradient), telling you the steepest way down. The learning rate is like deciding the length of your stride.
Choosing the right learning rate (α) is important for effective optimization. Let's consider what happens with different choices:
If the learning rate is too small: You take tiny steps downhill. You will eventually reach the bottom (the minimum cost), but it might take a very long time and many iterations. Progress will be slow.
If the learning rate is too large: You take giant strides downhill. You might overshoot the bottom of the valley and end up on the other side, potentially even higher up than where you started! The cost might bounce around erratically and fail to decrease, or it might even increase over time (divergence).
If the learning rate is "just right": You take reasonably sized steps, making steady progress towards the minimum without overshooting too much. This usually leads to finding a good solution efficiently.
Let's visualize this. Imagine a simple cost function that depends on just one parameter, plotted against the parameter's value. We want to find the lowest point. The following chart simulates how the cost might change over iterations for different learning rates.
Simulation of cost function J(w)=w2 starting at w=5, using gradient descent w=w−α(2w). A good learning rate (green) steadily decreases the cost. A small rate (yellow) decreases cost very slowly. A large rate (orange) might oscillate without improvement. An even larger rate (red) can cause the cost to increase (diverge).
So, how do you pick the learning rate? Finding the optimal learning rate often involves some experimentation. Common starting values might be 0.1, 0.01, 0.001, or 0.0001. You might try a few different values and see which one causes the cost function to decrease steadily and reasonably quickly during the initial training iterations. There are also more advanced techniques to automatically adjust the learning rate during training, but for now, understand that it's a parameter you typically set before starting the optimization process.
In summary, the learning rate α is a small but significant number that controls how big a step gradient descent takes at each iteration. It requires careful selection: too small leads to slow convergence, while too large can cause instability and divergence. Getting it right helps the algorithm efficiently find the parameter values that minimize the cost function.
© 2025 ApX Machine Learning