As we established, training a neural network involves adjusting its weights and biases to minimize a loss function, typically using gradient descent. The core requirement for gradient descent is the gradient of the loss function with respect to every parameter (weight wij(l) and bias bi(l)) in the network. Calculating these gradients naively for a deep network with potentially millions of parameters would be computationally prohibitive.
This is where the backpropagation algorithm comes in. It's not a new optimization algorithm itself, but rather a remarkably efficient method for computing the necessary gradients. Backpropagation elegantly applies the chain rule of calculus, working backward from the final loss calculation through the network layers to determine how each parameter contributes to that loss.
Think of it like assigning "blame" or "credit" for the final error. If the network's output is far from the target, backpropagation helps determine which connections (weights) and thresholds (biases) were most responsible for the discrepancy, and in which direction they should be adjusted.
The algorithm operates in two main passes:
Forward Pass: This is the standard process of feeding an input x through the network, layer by layer, calculating the weighted sums (z(l)) and activations (a(l)) at each layer until we reach the final output activation a(L) and compute the loss L. During this pass, it's essential to store the intermediate values, specifically the weighted sums z(l) and activations a(l) for each layer l, as they are needed for the backward pass.
Backward Pass: This is the core of backpropagation. It starts at the output layer and propagates the gradient of the loss backward through the network.
Output Layer (Layer L): We first compute the gradient of the loss with respect to the output activations, ∂a(L)∂L. The exact form depends on the loss function used. Then, using the chain rule and the stored z(L) value, we compute the gradient of the loss with respect to the pre-activation value z(L) for the output layer. This term is often denoted as the "error" δ(L) for the output layer:
δ(L)=∂z(L)∂L=∂a(L)∂L⊙g′(z(L))Here, g′ is the derivative of the activation function used in the output layer, and ⊙ represents element-wise multiplication (Hadamard product). This is because both ∂a(L)∂L and g′(z(L)) are vectors (or tensors) of the same dimension as the output layer.
Gradients for Output Layer Parameters: Once we have δ(L), we can compute the gradients for the weights W(L) and biases b(L) connecting to the output layer:
∂W(L)∂L=δ(L)(a(L−1))T ∂b(L)∂L=δ(L)(Note: For ∂W(L)∂L, this is an outer product if δ(L) and a(L−1) are vectors. Summation over the batch dimension is implicit if using mini-batches).
Propagating the Error Backward: The crucial step is to calculate how the error δ(L) propagates back to the previous layer's activations, a(L−1). We need ∂a(L−1)∂L to continue the process. This is found by considering how a(L−1) influences z(L) (via W(L)) and then how z(L) influences L (via δ(L)):
∂a(L−1)∂L=(W(L))Tδ(L)Hidden Layers (Layer l): Now we generalize the process for any hidden layer l (moving backward from L−1 down to 1). We assume we have already computed ∂a(l)∂L (which is the error propagated from layer l+1). We compute the error δ(l) for the current layer:
δ(l)=∂z(l)∂L=∂a(l)∂L⊙g′(z(l))=((W(l+1))Tδ(l+1))⊙g′(z(l))This equation shows how the error from the next layer (δ(l+1)), weighted by the connections (W(l+1)), is combined with the local gradient of the activation function (g′(z(l))) to form the error for the current layer.
Gradients for Hidden Layer Parameters: Using δ(l), we compute the gradients for the parameters of layer l:
∂W(l)∂L=δ(l)(a(l−1))T ∂b(l)∂L=δ(l)Continue Backward: We repeat this process, calculating δ(l), ∂W(l)∂L, and ∂b(l)∂L for each layer l from L−1 down to 1. Each step uses the error δ(l+1) computed in the previous step (the step for the layer l+1).
The diagram below illustrates the backward flow of gradient information. The forward pass computes values from left to right. The backward pass (dashed red lines) computes gradients starting from the loss L and moves right to left, calculating the required partial derivatives (∂L/∂…) at each stage.
Flow of backpropagation. Solid lines represent the forward pass computations. Dashed lines represent the backward flow of gradient calculations, starting from the Loss (L) and moving leftward through the layers. Each dashed edge label indicates the gradient being computed at that step.
The efficiency of backpropagation comes from reusing calculations. Notice how the error δ(l) calculated for layer l is directly used to compute both the gradients ∂W(l)∂L, ∂b(l)∂L and the error term δ(l−1) needed for the previous layer (via (W(l))Tδ(l)). This avoids redundant computations that would occur if we tried to calculate each ∂Wij∂L and ∂bi∂L independently from scratch using the chain rule all the way back to the input.
In essence, backpropagation is a practical application of reverse-mode automatic differentiation applied to neural networks. It provides a structured way to compute the gradient of a scalar objective function (the loss) with respect to all its input parameters (the weights and biases) in a time complexity roughly proportional to the cost of the forward pass itself.
Once the backward pass is complete, we have the gradients ∂W(l)∂L and ∂b(l)∂L for all layers l. These gradients are then used by an optimization algorithm (like SGD, Adam, etc.) to update the network's parameters:
W(l)←W(l)−η∂W(l)∂L b(l)←b(l)−η∂b(l)∂L
where η is the learning rate. This completes one cycle of forward pass, loss calculation, backward pass (backpropagation), and parameter update, forming the basis of how neural networks learn.
© 2025 ApX Machine Learning