Okay, let's connect the chain rule directly to the mechanics of training neural networks. As we discussed, a neural network is essentially a large composite function. The final output (e.g., a classification score or a regression value) depends on the outputs of the previous layer, which depend on the outputs of the layer before that, all the way back to the input data. Crucially, the function computed at each layer involves weights and biases, the parameters we need to adjust.
Training involves minimizing a loss function, let's call it L, which measures the difference between the network's prediction and the actual target value. To minimize L using gradient descent (or its variants), we need to calculate the partial derivative of L with respect to every single weight W and bias b in the network: ∂W∂L and ∂b∂L. Given the nested structure, calculating these directly seems daunting. This is where the chain rule becomes indispensable.
Backpropagation is not a new optimization algorithm; it's an efficient algorithm for computing gradients in a neural network. It systematically applies the multivariable chain rule to calculate all the necessary partial derivatives (∂Wij[l]∂L and ∂bi[l]∂L for all layers l, neurons i, and input connections j) by working backward from the loss function.
Imagine the forward pass: input data flows through the network, layer by layer, undergoing linear transformations (multiplying by weights, adding biases) and non-linear activations, ultimately producing an output prediction and then a loss value L.
Backpropagation reverses this flow:
Start at the End: The process begins by calculating the derivative of the loss L with respect to the final output activation of the network, a[L] (where L denotes the final layer). This is usually straightforward, depending on the specific loss function and final activation function used. Let's denote ∂a[L]∂L as δa[L].
Gradient w.r.t. Pre-activation (z[L]): Using the chain rule, we find the gradient with respect to the pre-activation linear output z[L] of the final layer. If a[L]=g(z[L]), where g is the activation function, then: ∂z[L]∂L=∂a[L]∂L⋅∂z[L]∂a[L]=δa[L]⋅g′(z[L]) Let's denote ∂z[L]∂L as δz[L]. The term g′(z[L]) is the derivative of the activation function evaluated at z[L].
Gradients w.r.t. Parameters (W[L],b[L]): Now that we have δz[L], we can find the gradients for the weights W[L] and biases b[L] of the final layer. Since z[L]=W[L]a[L−1]+b[L]: ∂W[L]∂L=∂z[L]∂L⋅∂W[L]∂z[L]=δz[L]⋅(a[L−1])T ∂b[L]∂L=∂z[L]∂L⋅∂b[L]∂z[L]=δz[L]⋅1=δz[L] (Note: The calculation for ∂W[L]∂L involves matrix/vector operations, resulting in a gradient matrix of the same shape as W[L]. We often sum δz[L] across the batch dimension for ∂b[L]∂L.)
Propagate Gradient to Previous Layer: The truly clever part is propagating the error gradient backward. We need to find ∂a[L−1]∂L, the gradient with respect to the activations of the previous layer. Again, using the chain rule through z[L]: ∂a[L−1]∂L=∂z[L]∂L⋅∂a[L−1]∂z[L]=δz[L]⋅(W[L])T Let's denote this δa[L−1]. This tells us how much the activations in layer L−1 contributed to the final loss L.
Repeat for Layer L−1: Now we have δa[L−1]. We can repeat steps 2, 3, and 4 for layer L−1:
Continue Until Input Layer: This process repeats, stepping backward through each layer until we reach the input layer. At each step l, we use the incoming gradient δa[l] (calculated from layer l+1) to compute δz[l], then the gradients for W[l] and b[l], and finally the gradient δa[l−1] to pass back to the previous layer.
This systematic backward application of the chain rule ensures that we compute the gradient of the loss L with respect to every parameter in the network efficiently, reusing calculations where possible. The gradients ∂W[l]∂L and ∂b[l]∂L for all layers are then used in an optimization algorithm like gradient descent to update the parameters:
W[l]:=W[l]−α∂W[l]∂L b[l]:=b[l]−α∂b[l]∂L
where α is the learning rate.
Simplified view of forward propagation (calculating loss) and backward propagation (calculating gradients) in a two-layer network. Backpropagation applies the chain rule layer by layer, starting from the loss, to compute the gradient of the loss with respect to each parameter (W[l],b[l]) and activation (a[l]). The red nodes and arrows indicate the flow and calculation of gradients.
Understanding this backward flow and the application of the chain rule at each step is fundamental to comprehending how neural networks learn. The next section will look at how computational graphs can help visualize this process even more clearly.
© 2025 ApX Machine Learning