Understanding the theory behind gradient descent is one thing; seeing it in action provides a much more intuitive grasp of how it navigates the loss landscape to find optimal parameters. In this practical section, we'll use simple visualizations to illustrate the behavior of the gradient descent algorithm.
We'll start with a very simple, one-dimensional convex function, like f(x)=x2. Think of this as a simplified loss function where we only have one parameter, x, to optimize. Our goal is to find the value of x that minimizes f(x). We know from calculus the minimum is at x=0. Let's see how gradient descent finds it.
The gradient (derivative in this 1D case) of f(x)=x2 is f′(x)=2x. The gradient descent update rule is:
xnew=xold−α⋅f′(xold)
xnew=xold−α⋅(2xold)
where α is the learning rate.
Let's simulate this process. We'll start at an arbitrary point, say x0=4, and choose a learning rate, say α=0.1.
- Step 0: x0=4.0. f(x0)=16.0. Gradient f′(x0)=2×4.0=8.0.
- Step 1: x1=x0−α⋅f′(x0)=4.0−0.1×8.0=4.0−0.8=3.2. f(x1)=10.24. Gradient f′(x1)=2×3.2=6.4.
- Step 2: x2=x1−α⋅f′(x1)=3.2−0.1×6.4=3.2−0.64=2.56. f(x2)=6.55. Gradient f′(x2)=2×2.56=5.12.
- Step 3: x3=x2−α⋅f′(x2)=2.56−0.1×5.12=2.56−0.512=2.048. f(x3)=4.19.
- ... and so on. Notice how the value of x gets progressively closer to 0, and the size of the steps (determined by the gradient) also decreases as we approach the minimum.
Let's visualize this path.
Gradient descent iteratively steps towards the minimum of the quadratic function f(x)=x2, starting from x=4 with a learning rate α=0.1.
The Effect of the Learning Rate
The learning rate α is a critical hyperparameter. Let's see what happens with different values, still starting at x0=4.
- Small Learning Rate (α=0.01): The steps will be tiny, leading to very slow progress towards the minimum.
- Large Learning Rate (α=0.95): The steps might be too large, causing the algorithm to overshoot the minimum and oscillate back and forth. It might still converge, but potentially inefficiently.
- Too Large Learning Rate (α=1.05): The steps are so large that each update moves further away from the minimum than the previous point. The algorithm diverges.
Comparison of gradient descent paths for different learning rates (α) on f(x)=x2. Small values converge slowly, appropriate values converge efficiently, larger values can overshoot or even diverge.
This highlights why selecting an appropriate learning rate is so significant for successful training.
Visualizing in Two Dimensions
Neural network loss functions typically depend on millions of parameters, making their landscapes impossible to visualize directly. However, we can get a better intuition by visualizing gradient descent on a function of two variables, say f(x,y)=x2+y2. The minimum is clearly at (0,0).
The gradient is now a vector:
∇f(x,y)=[∂x∂f,∂y∂f]=[2x,2y]
The update rule becomes:
xnew=xold−α⋅(2xold)
ynew=yold−α⋅(2yold)
Let's start at (x0,y0)=(3,4) with α=0.1.
- Step 0: (x0,y0)=(3.0,4.0). f(x0,y0)=9+16=25. Gradient ∇f(3,4)=[6,8].
- Step 1:
- x1=3.0−0.1×6=3.0−0.6=2.4.
- y1=4.0−0.1×8=4.0−0.8=3.2.
- (x1,y1)=(2.4,3.2). f(x1,y1)=5.76+10.24=16.0. Gradient ∇f(2.4,3.2)=[4.8,6.4].
- Step 2:
- x2=2.4−0.1×4.8=2.4−0.48=1.92.
- y2=3.2−0.1×6.4=3.2−0.64=2.56.
- (x2,y2)=(1.92,2.56). f(x2,y2)=3.6864+6.5536=10.24.
We can visualize this as a path on a contour plot of the function f(x,y). The gradient at any point is perpendicular to the contour line passing through that point and points in the direction of steepest ascent. Gradient descent takes steps in the opposite direction (steepest descent).
Gradient descent path on the contour plot of f(x,y)=x2+y2. Starting from (3, 4), the algorithm takes steps perpendicular to the contour lines towards the minimum at (0, 0).
These visualizations, although simple, illustrate the core mechanism of gradient descent. In deep learning, the "landscape" is vastly more complex and high-dimensional, with potentially many local minima and saddle points (as discussed in gradient-descent-challenges
). However, the fundamental idea remains the same: follow the negative gradient to iteratively reduce the loss. Understanding this visual intuition is helpful when diagnosing training issues or tuning hyperparameters like the learning rate.