Computing derivatives for functions of a single variable is a foundational concept. For functions composed together, such as , the single-variable chain rule () helps determine the overall rate of change. Machine learning models, especially neural networks, often involve functions with multiple inputs and outputs, creating more complex dependencies. A method to extend the chain rule is required to handle these multivariable scenarios.
Imagine a function that depends on two intermediate variables, and . Furthermore, imagine that both and themselves depend on a single underlying variable, . So we have relationships like , , and . How does change as changes?
Since affects through two paths (one via , one via ), we need to account for the change propagated along both paths. The change in due to a small change in is the sum of the changes propagated through and .
This leads to the multivariable chain rule for this case:
Let's break this down:
We can visualize these dependencies using a simple graph:
A dependency graph illustrating how changes in propagate through intermediate variables and to affect the final output . The labels indicate the relevant derivatives along each path.
Now, let's consider a more general situation common in machine learning. Suppose we have a final output variable that depends on several intermediate variables . Each of these intermediate variables, in turn, depends on several input variables .
So, and for each from 1 to .
We often want to know how the final output changes with respect to one specific input variable, say . To find the partial derivative , we need to consider all the paths through which can influence . The input can affect any of the intermediate variables , and each of these can, in turn, affect .
The general form of the multivariable chain rule sums up the influence along all these intermediate paths:
This formula states that the total rate of change of with respect to is the sum, over all intermediate variables , of: (the rate of change of with respect to ) times (the rate of change of with respect to ).
This generalized chain rule is the mathematical engine behind backpropagation in neural networks. Think of as the network's loss function (e.g., mean squared error). The inputs could be the network's weights or biases in a particular layer. The intermediate variables represent the activations or outputs of neurons in subsequent layers.
To train the network using gradient descent, we need to compute the gradient of the loss with respect to each weight and bias . The network structure creates a deep chain of dependencies: the loss depends on the final layer's output, which depends on the previous layer's output and weights, and so on, all the way back to the specific weight .
The multivariable chain rule provides the recipe for calculating these required gradients, . Backpropagation is essentially an algorithm that efficiently applies this chain rule recursively, layer by layer, starting from the output layer and working backward toward the input layer, computing the necessary partial derivatives at each step. Understanding this rule is therefore fundamental to understanding how neural networks learn. We will see exactly how this is applied in the context of backpropagation later in this chapter.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•