Okay, we have established that loss functions like MSE or Cross-Entropy give us a measure of how wrong our neural network's predictions are. The goal of training is to make this error as small as possible. But how do we actually adjust the network's weights and biases to reduce the loss? This is where the optimization algorithm comes in, and the most fundamental one is Gradient Descent.
Imagine you are standing on a foggy mountain range, and your goal is to reach the lowest valley (the minimum loss). You can only see the ground right beneath your feet. How would you proceed? A sensible strategy would be to feel the slope of the ground where you are and take a step in the steepest downhill direction. You repeat this process, taking step after step downhill, hoping to eventually reach the bottom.
Gradient Descent works in a very similar way. The "mountain range" is the loss function's surface, defined by the network's parameters (weights w and biases b, collectively denoted as θ). The "altitude" at any point is the value of the loss function for a given set of parameters θ. Our goal is to find the parameters θ that correspond to the lowest point on this surface, the minimum loss J(θ).
Calculus gives us a tool to find the "steepest direction": the gradient. The gradient of the loss function with respect to the parameters, denoted as ∇J(θ), is a vector that points in the direction of the steepest ascent on the loss surface. Since we want to go downhill to minimize the loss, we take steps in the direction opposite to the gradient (the negative gradient, −∇J(θ)).
The core idea of Gradient Descent is to iteratively update the parameters by taking small steps in the negative gradient direction. The update rule for a single parameter (or the entire set of parameters θ) looks like this:
θnew=θold−α∇J(θold)Let's break this down:
So, the process involves:
Let's visualize this on a simple 1D loss curve, like J(w)=w2. The minimum is clearly at w=0. The gradient is dwdJ=2w.
A simple quadratic loss function J(w)=w2. Starting at w=4, Gradient Descent iteratively takes steps proportional to the negative gradient (−2w) multiplied by the learning rate (α=0.2) towards the minimum at w=0.
In practice, for neural networks, the loss surface is much more complex and high-dimensional (one dimension for each weight and bias). Calculating the gradient ∇J(θ) involves computing the partial derivative of the loss with respect to every single weight and bias in the network. While conceptually simple, the actual computation requires the chain rule and the backpropagation algorithm.
A potential issue with the basic version described here (often called Batch Gradient Descent) is that calculating ∇J(θ) requires evaluating the loss and its gradient over the entire training dataset in each step. For large datasets, this can be extremely slow and computationally expensive. This motivates variations like Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent, which we will discuss shortly.
For now, the significant takeaway is that Gradient Descent provides the mechanism for learning. By repeatedly calculating the direction of steepest descent (negative gradient) on the loss surface and taking steps in that direction, it iteratively adjusts the network's parameters to minimize the prediction error.
© 2025 ApX Machine Learning