Okay, let's break down the mechanics of the Gradient Descent algorithm. Having established the intuition, descending a cost function landscape to find a minimum, we now formalize the steps involved in this iterative optimization process.
The core idea is to repeatedly adjust the model parameters, typically denoted by the vector θ, in a direction that decreases the cost function J(θ). Since the gradient ∇J(θ) points in the direction of the steepest ascent, we move in the opposite direction.
The Update Rule
Gradient descent updates each parameter θj (where j indexes the different parameters in our model, like weights and biases) using the following rule in each iteration:
θj:=θj−α∂θj∂J(θ)
Let's dissect this crucial update step:
- θj: This is the current value of the j-th parameter we want to optimize. Our goal is to find the values for all θj that minimize the overall cost J(θ).
- :=: This denotes assignment. The value of θj is updated with the result calculated on the right-hand side.
- α: This is the learning rate, a hyperparameter that controls the size of the step we take during each iteration. A smaller α means smaller steps, potentially leading to slower convergence but more stability. A larger α can speed up convergence but risks overshooting the minimum or even diverging. We'll examine the learning rate in more detail in the next section.
- ∂θj∂J(θ): This is the partial derivative of the cost function J(θ) with respect to the parameter θj. It represents the slope of the cost function along the axis of θj at the current point defined by all parameters θ. This term tells us how much the cost function J changes if we slightly change θj, holding all other parameters constant. It's one component of the gradient vector ∇J(θ). The sign of this derivative indicates whether increasing θj increases or decreases the cost. By subtracting α times this derivative, we are nudged in the direction that decreases the cost.
The Iterative Process
Gradient descent is an iterative algorithm. It doesn't typically find the optimal parameters in a single step. Instead, it follows these general stages:
- Initialization: Start with initial guesses for the parameters θ. These could be zeros, small random values, or values from a pre-trained model.
- Iteration: Repeat the following until a stopping criterion (convergence) is met:
- Compute the Gradient: Calculate the partial derivative term ∂θj∂J(θ) for all parameters j=0,1,...,n, based on the current parameter values θ. The exact data used for this calculation depends on the specific variant of gradient descent (Batch, Stochastic, or Mini-batch), which we'll discuss later in this chapter.
- Update Parameters: Apply the update rule simultaneously for all parameters:
θj:=θj−α∂θj∂J(θ)for j=0,1,...,n
- Convergence: Stop when the parameters are no longer changing significantly, the cost function J(θ) decrease becomes very small, the gradient magnitude is close to zero, or a maximum number of iterations is reached.
Simultaneous Updates: A Point of Attention
It's important that the updates for all parameters θj within a single iteration happen simultaneously. This means you should first compute the partial derivatives for all j using the current parameter values from the start of the iteration. Only after all these gradient components are calculated should you update the values of θ0,θ1,...,θn.
Consider a model with two parameters, θ0 and θ1. The correct update procedure for one iteration looks like this:
# Calculate gradient components using current theta0, theta1
gradient_component_0 = compute_partial_derivative(J, theta0, theta1, wrt=theta0)
gradient_component_1 = compute_partial_derivative(J, theta1, theta0, wrt=theta1)
# Calculate the new values using the computed gradients
temp0 = theta0 - alpha * gradient_component_0
temp1 = theta1 - alpha * gradient_component_1
# Update the parameters simultaneously
theta0 = temp0
theta1 = temp1
Updating θ0 first and then immediately using this new θ0 to calculate the gradient component for θ1 in the same iteration is incorrect for standard gradient descent. Store the updates temporarily or use vector operations to ensure simultaneous application.
Here is a concise pseudocode representation of the core algorithm:
Initialize parameters theta (e.g., randomly or with zeros)
Repeat until convergence {
Compute gradient_components[j] = partial_derivative(J, theta, wrt=theta[j]) for all j
For j = 0 to n {
theta[j] = theta[j] - alpha * gradient_components[j]
}
}
Return theta
These steps form the foundation of how gradient descent navigates the cost surface. The effectiveness and behavior of this process, however, heavily depend on factors like the learning rate α and how we compute the gradient, which we will explore next.