In the previous section, we saw how a cost function, like Mean Squared Error (MSE), measures how well our current line fits the data. A lower cost means a better fit. Our goal now is to find the specific values for the slope (m) and the y-intercept (b) that result in the lowest possible cost. But how do we systematically find these best values? This is where an algorithm called Gradient Descent comes in.
Think of the cost function as a landscape, perhaps a valley or a bowl shape. The value of the cost function (our error) corresponds to the altitude at any given point in this landscape. The coordinates in this landscape are the values of our parameters, m and b. Our goal is to find the lowest point in this valley, the point where the cost is minimal.
Gradient descent is an iterative algorithm that helps us "walk downhill" on this cost function landscape until we reach the bottom, or at least a very low point.
Imagine you are standing somewhere on a foggy hillside, and your goal is to reach the bottom of the valley. You can't see the whole landscape, but you can feel the steepness and direction of the slope right where you are standing.
You keep repeating this process. Each step takes you to a slightly lower point on the hillside. Eventually, if you keep taking small steps downhill, you'll end up at the bottom of the valley, the point of minimum altitude (minimum cost).
In the context of linear regression, gradient descent works by:
Initialization: Starting with some initial guesses for m and b. These could be anything, often just 0.
Calculate the Gradient: Calculating the gradient of the cost function J(m,b) at the current values of m and b. The gradient is a pair of values that tell us how much the cost function changes if we slightly change m or b. Specifically, it tells us the direction of the steepest increase in cost.
Update Parameters: Adjusting m and b by moving a small amount in the opposite direction of the gradient. This means subtracting a fraction of the gradient from the current parameter values. The size of this "step" is controlled by a parameter called the learning rate (often denoted by α, the Greek letter alpha).
The update rules look like this:
m:=m−α∂m∂J
b:=b−α∂b∂J
The :=
symbol means "is updated to". We simultaneously update both m and b based on the gradients calculated using the current values of m and b.
Iteration: Repeating steps 2 and 3 many times. With each iteration, the values of m and b should move closer to the values that minimize the cost function J(m,b), and the cost itself should decrease.
The learning rate, α, is a small positive number (e.g., 0.01, 0.001) that controls how big of a step we take downhill in each iteration. It's a critical parameter:
Choosing a good learning rate often involves some experimentation. We want it large enough for reasonably fast convergence but small enough to avoid overshooting or divergence.
Let's visualize the cost function J(m,b) as a contour plot. Each contour line represents points (m,b) where the cost is the same. The center of the contours represents the minimum cost. Gradient descent starts at some point (m0,b0) and takes steps perpendicular to the contour lines, moving towards the center.
A contour plot showing the cost function based on different values of slope (m) and intercept (b). The red line illustrates the path gradient descent might take, starting from an initial guess and iteratively moving towards the minimum cost point (center).
We typically stop the iteration process when one of these conditions is met:
Gradient descent is the optimization engine that allows our linear regression model to "learn" from the data. By iteratively adjusting the slope (m) and intercept (b) to minimize the cost function, it finds the line that represents the best fit for our training data according to the MSE metric.
© 2025 ApX Machine Learning