After backpropagation has determined the gradient of the loss function with respect to each weight and bias in the network (effectively calculating ∂W∂Loss and ∂b∂Loss for all parameters), the next step is to use this information to improve the network. The gradients tell us the direction of the steepest increase in the loss. Since our goal is to minimize the loss, we need to adjust the parameters in the opposite direction of their respective gradients.
This adjustment process is the core of gradient descent. For each weight (W) and bias (b) in the network, we update its value based on its current value, the calculated gradient, and a parameter called the learning rate (η).
The update rules are straightforward:
For a specific weight Wij (connecting neuron i in one layer to neuron j in the next): Wij,new=Wij,old−η∂Wij,old∂Loss
Similarly, for a specific bias bj (of neuron j): bj,new=bj,old−η∂bj,old∂Loss
Let's break down these equations:
The subtraction sign is fundamental here. If the gradient ∂Wij,old∂Loss is positive (meaning increasing Wij increases the loss), the update rule subtracts a positive value (η×gradient), thereby decreasing Wij. Conversely, if the gradient is negative (meaning increasing Wij decreases the loss), the update rule subtracts a negative value, which results in increasing Wij. In both cases, the adjustment pushes the parameter value in a direction expected to lower the overall loss.
In practice, performing these updates individually for every single parameter would be computationally inefficient. Deep learning frameworks leverage linear algebra libraries (like NumPy, TensorFlow, or PyTorch) to perform these updates using matrix and vector operations, applying the same logic simultaneously to all weights and biases within a layer or even the entire network.
If we represent all weights in a layer as a matrix W and all biases as a vector b, and their corresponding gradients as ∇WLoss and ∇bLoss, the update rules can be expressed concisely as:
Wnew=Wold−η∇WoldLoss bnew=bold−η∇boldLoss
Here, ∇WoldLoss is a matrix of the same dimensions as Wold, where each element is the partial derivative of the loss with respect to the corresponding weight. Similarly, ∇boldLoss is a vector containing the partial derivatives with respect to each bias term. The subtraction and scalar multiplication (η) are performed element-wise.
This update step is typically performed once per batch of training data (in Mini-batch Gradient Descent) or after processing the entire dataset (in Batch Gradient Descent). This iterative process of calculating predictions (forward propagation), measuring error (loss function), computing gradients (backpropagation), and adjusting parameters (weight/bias updates) constitutes the core training loop for neural networks, gradually guiding the network towards a configuration that minimizes the prediction error on the training data.
© 2025 ApX Machine Learning