Okay, we know we need to calculate the gradient of the loss function with respect to each weight and bias in the network. This gradient tells us the direction of steepest ascent of the loss function; we want to move in the opposite direction (gradient descent) to minimize the error. The central challenge is computing these gradients efficiently, especially for networks with many layers and parameters. Trying to calculate the impact of each individual parameter on the final loss by slightly changing it (numerical differentiation) is computationally infeasible for practical networks.
This is where the backpropagation algorithm comes in. It's a remarkably efficient method for computing all the necessary gradients by leveraging the chain rule from calculus. The name "backpropagation" comes from the fact that it computes the gradients starting from the final output layer and works its way backward through the network to the input layer, propagating the error gradient as it goes.
At its heart, a neural network is a series of nested functions. The output of one layer becomes the input to the next. To find how a change in a parameter deep inside the network affects the final loss, we need to apply the chain rule repeatedly.
Consider a very simple network calculation:
If we want to find the gradient of the loss L with respect to a weight in the first layer, say w(1), we need to trace its influence forward to the loss: w(1) affects z(1), which affects a(1), which affects z(2), which affects a(2), which finally affects L. The chain rule allows us to break this down:
∂w(1)∂L=∂a(2)∂L×∂z(2)∂a(2)×∂a(1)∂z(2)×∂z(1)∂a(1)×∂w(1)∂z(1)Backpropagation organizes this calculation efficiently. It calculates terms like ∂a(2)∂L and ∂z(2)∂L first, and then reuses them to compute gradients for earlier layers.
The algorithm works in two main phases:
Forward Pass: Input data is fed through the network layer by layer, calculating pre-activations (z(l)) and activations (a(l)) at each step, ultimately producing the final output prediction a(L). During this pass, we store the intermediate values (z(l) and a(l) for all layers l) as they will be needed for the backward pass. The final loss L is calculated using the prediction a(L) and the true target y.
Backward Pass: This pass computes the gradients.
Output Layer: Calculate the gradient of the loss with respect to the output layer's activation, ∂a(L)∂L. Then, using the chain rule, calculate the gradient with respect to the output layer's pre-activation z(L): δ(L)=∂z(L)∂L=∂a(L)∂L∂z(L)∂a(L)=∂a(L)∂Lσ′(z(L)) Here, σ′(z(L)) is the derivative of the activation function used in the output layer, evaluated at z(L). This term δ(L) represents the "error signal" at the output layer's pre-activation. Now, calculate the gradients for the output layer's weights W(L) and biases b(L): ∂W(L)∂L=∂z(L)∂L∂W(L)∂z(L)=δ(L)(a(L−1))T ∂b(L)∂L=∂z(L)∂L∂b(L)∂z(L)=δ(L) (Note: The exact forms depend on whether we are using vector/matrix notation. a(L−1) represents the activations from the previous layer feeding into the output layer. The transpose ensures dimensions match for matrix multiplication).
Hidden Layers (working backward from l=L−1 down to l=1): For each hidden layer l, calculate its error signal δ(l) based on the error signal δ(l+1) from the next layer (the one closer to the output): δ(l)=∂z(l)∂L=((W(l+1))Tδ(l+1))⊙σ′(z(l)) Here, (W(l+1))T are the weights connecting layer l to layer l+1. This step effectively propagates the error gradient backward through the network weights. The term σ′(z(l)) is the derivative of the activation function of the current hidden layer l. The symbol ⊙ denotes element-wise multiplication (Hadamard product). Once δ(l) is known, calculate the gradients for layer l's weights W(l) and biases b(l) similar to the output layer: ∂W(l)∂L=δ(l)(a(l−1))T ∂b(l)∂L=δ(l) Where a(l−1) are the activations from the layer feeding into layer l (for the first hidden layer, this would be the input data x).
This process continues backward until gradients for all parameters have been computed.
We can think of the computation as flowing forward to get the loss, and then backward to get the gradients.
A simplified view of forward (blue arrows) and backward (red arrows) passes. The forward pass computes activations and loss. The backward pass computes gradients, starting from the loss and propagating error signals (like δ(l)=∂z(l)∂L) backward, reusing values calculated during the forward pass (a(l), z(l)) and network weights (W(l)). Gradients for parameters (W,b) are derived from these error signals.
The efficiency of backpropagation comes from two main aspects:
Compared to numerically estimating gradients, which requires at least one extra forward pass for each parameter, backpropagation computes all gradients in roughly the same amount of computational time as a single forward pass. For a network with millions of parameters, this difference is enormous, making the training of deep networks feasible.
With the gradients computed via backpropagation, we now have the direction needed to update the weights and biases using gradient descent, which we will detail in the following sections.
© 2025 ApX Machine Learning