We've established how to measure our network's error using a loss function L. A high loss value means poor predictions, while a low value signifies the network is doing well. The objective of training is to find the set of weights W and biases b that minimize this loss function. But how do we systematically find these optimal parameters?
Think of the loss function as defining a landscape, potentially with hills and valleys, where the height at any point corresponds to the loss value for a specific configuration of the network's weights and biases. Our goal is to find the lowest point in this landscape, the bottom of a valley. Imagine standing somewhere on a hill in this landscape, perhaps in thick fog where you can only sense the terrain immediately around you. To get to the lowest point, a sensible strategy is to feel the ground to determine which direction slopes downward most steeply, and then take a step in that direction. By repeating this process, you should gradually make your way down towards the valley floor.
This is precisely the intuition behind gradient descent. In mathematical terms, the "slope" or "steepness" in every possible direction at our current location on the loss surface is captured by the gradient. The gradient of the loss function with respect to the parameters, often denoted as ∇L, is a vector. Each component of this vector represents the partial derivative of the loss function with respect to one specific parameter (a weight or a bias). Importantly, the gradient vector points in the direction where the loss function increases most rapidly. Since our goal is to decrease the loss, we need to move in the direction opposite to the gradient.
The loss L depends on all the adjustable parameters in the network. Therefore, the gradient ∇L tells us how sensitive the loss is to a small change in each parameter. For instance, the partial derivative ∂wij∂L quantifies how much the loss L would change if we made an infinitesimal adjustment to the specific weight wij connecting neuron i to neuron j.
Gradient descent uses this gradient information to iteratively update the network's parameters. In each iteration of the training process:
The size of this "step" is determined by a parameter called the learning rate (often denoted by η or α), which scales the gradient. We'll discuss the learning rate in more detail shortly, but conceptually, the update rule for a single parameter, say a weight w, looks like this:
wnew=wold−η∂wold∂LThis formula states that the new value of the weight is the old value minus a small fraction (determined by η) of the gradient component for that weight. We perform this update simultaneously for all weights and biases in the network.
Let's visualize this process. Consider a simplified case where the loss L depends on only one parameter w. The loss function might be represented by a curve. Gradient descent starts at some initial guess w0 and takes successive steps down the slope of the curve until it approaches the minimum point.
Starting from an initial parameter value (e.g., w=4), gradient descent iteratively calculates the slope (gradient) at the current point and takes a step downhill (in the direction opposite to the gradient) towards the minimum loss. The learning rate controls the size of these steps.
In practical neural networks, the loss landscape is not a simple 1D curve or even a 2D surface. It's a high-dimensional space defined by potentially thousands or millions of parameters. Visualizing this is impossible, but the mathematical principle of gradient descent remains the same: calculate the gradient vector ∇L (which contains the partial derivative for every single weight and bias) and update all parameters simultaneously by moving them in the direction opposite to their gradient component, scaled by the learning rate.
This process, calculating the gradient and updating parameters, is repeated many times. With each iteration (often called an epoch when it involves processing the entire training dataset), the parameters should incrementally adjust to values that yield a lower loss. We typically continue this iterative process until the loss value converges, meaning it stops decreasing significantly, suggesting we have found a minimum point in the loss landscape.
While gradient descent provides a powerful method for optimizing network parameters, it's not without challenges. The complex, high-dimensional loss surfaces of neural networks are often non-convex, meaning they can contain many local minima (points that are lower than their immediate surroundings but not the lowest overall point) in addition to the desired global minimum. The algorithm might converge to a local minimum, which might not represent the best possible solution. Furthermore, the choice of learning rate and the specific variant of gradient descent used can significantly impact training speed and stability. These are considerations we'll explore as we proceed. For now, the fundamental idea is clear: follow the negative gradient downhill to minimize the loss. The next step is understanding how to calculate that gradient efficiently.
© 2025 ApX Machine Learning