The chain rule is directly applicable to the mechanics of training neural networks. A neural network operates as a large composite function. Its final output (e.g., a classification score or a regression value) depends on the outputs of the preceding layer, which in turn depend on the outputs of the layer before that, extending back to the input data. Each layer computes a function involving weights and biases, which are the parameters that require adjustment.
Training involves minimizing a loss function, let's call it , which measures the difference between the network's prediction and the actual target value. To minimize using gradient descent (or its variants), we need to calculate the partial derivative of with respect to every single weight and bias in the network: and . Given the nested structure, calculating these directly seems daunting. This is where the chain rule becomes indispensable.
Backpropagation is not a new optimization algorithm; it's an efficient algorithm for computing gradients in a neural network. It systematically applies the multivariable chain rule to calculate all the necessary partial derivatives ( and for all layers , neurons , and input connections ) by working backward from the loss function.
Imagine the forward pass: input data flows through the network, layer by layer, undergoing linear transformations (multiplying by weights, adding biases) and non-linear activations, ultimately producing an output prediction and then a loss value .
Backpropagation reverses this flow:
Start at the End: The process begins by calculating the derivative of the loss with respect to the final output activation of the network, (where denotes the final layer). This is usually straightforward, depending on the specific loss function and final activation function used. Let's denote as .
Gradient w.r.t. Pre-activation (): Using the chain rule, we find the gradient with respect to the pre-activation linear output of the final layer. If , where is the activation function, then: Let's denote as . The term is the derivative of the activation function evaluated at .
Gradients w.r.t. Parameters (): Now that we have , we can find the gradients for the weights and biases of the final layer. Since : (Note: The calculation for involves matrix/vector operations, resulting in a gradient matrix of the same shape as . We often sum across the batch dimension for .)
Propagate Gradient to Previous Layer: The truly clever part is propagating the error gradient backward. We need to find , the gradient with respect to the activations of the previous layer. Again, using the chain rule through : Let's denote this . This tells us how much the activations in layer contributed to the final loss .
Repeat for Layer : Now we have . We can repeat steps 2, 3, and 4 for layer :
Continue Until Input Layer: This process repeats, stepping backward through each layer until we reach the input layer. At each step , we use the incoming gradient (calculated from layer ) to compute , then the gradients for and , and finally the gradient to pass back to the previous layer.
This systematic backward application of the chain rule ensures that we compute the gradient of the loss with respect to every parameter in the network efficiently, reusing calculations where possible. The gradients and for all layers are then used in an optimization algorithm like gradient descent to update the parameters:
where is the learning rate.
Simplified view of forward propagation (calculating loss) and backward propagation (calculating gradients) in a two-layer network. Backpropagation applies the chain rule layer by layer, starting from the loss, to compute the gradient of the loss with respect to each parameter () and activation (). The red nodes and arrows indicate the flow and calculation of gradients.
Understanding this backward flow and the application of the chain rule at each step is fundamental to comprehending how neural networks learn. The next section will look at how computational graphs can help visualize this process even more clearly.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with