In the previous chapter, we established that the gradient, ∇J(θ), points in the direction where the cost function J(θ) increases most rapidly. However, our goal in machine learning optimization is typically the opposite: we want to minimize the cost function to find the model parameters θ that give the best performance. So, how can the gradient help us go downhill instead of uphill?
Imagine you are standing on a hillside, surrounded by thick fog. You can't see the valley floor (the minimum), but you can feel the slope of the ground beneath your feet. To get down the hill, the most intuitive approach is to identify the direction of the steepest descent and take a step in that direction. You repeat this process, constantly checking the slope and taking steps downward, hoping to eventually reach the bottom.
Gradient descent works on a very similar principle. The "slope" we can measure at any point θ in our parameter space is given by the gradient ∇J(θ). Since the gradient points uphill (steepest ascent), the direction of steepest descent is simply the opposite direction: −∇J(θ).
Therefore, the core idea of gradient descent is to start with some initial guess for the parameters θ and iteratively update them by taking small steps in the direction of the negative gradient. Each step should, ideally, move us closer to a point where the cost function J(θ) is minimal.
Consider a simple cost function that depends on two parameters, θ1 and θ2. We can visualize this function as a surface or, more commonly, using contour lines, where each line represents points of equal cost.
A contour plot showing level sets of a cost function J(θ1,θ2). The red line illustrates steps taken by gradient descent, starting from an initial point and moving towards the minimum (green 'x'), always perpendicular to the contour lines (in the direction of the negative gradient).
Each step involves two parts:
The size of the step taken in each iteration is a significant factor. If the steps are too large, we might overshoot the minimum and bounce around erratically, potentially even diverging. If the steps are too small, the algorithm might take a very long time to reach the minimum. This step size is controlled by a parameter called the learning rate, which we will examine in detail shortly.
In essence, gradient descent provides an automated way to "walk down the hill" of the cost function. By repeatedly calculating the local slope (gradient) and taking a step in the downward direction, it navigates the parameter space to find values θ that minimize the cost J(θ), thereby optimizing the machine learning model.
© 2025 ApX Machine Learning