In the previous chapter, we saw how to compute derivatives for functions of a single variable. When dealing with functions composed together, like f(g(x)), the single-variable chain rule (dxdy=dudydxdu) allows us to find the overall rate of change. However, machine learning models, especially neural networks, often involve functions with multiple inputs and outputs, creating more complex dependencies. We need a way to extend the chain rule to handle these multivariable scenarios.
Imagine a function z that depends on two intermediate variables, x and y. Furthermore, imagine that both x and y themselves depend on a single underlying variable, t. So we have relationships like z=f(x,y), x=g(t), and y=h(t). How does z change as t changes?
Since t affects z through two paths (one via x, one via y), we need to account for the change propagated along both paths. The change in z due to a small change in t is the sum of the changes propagated through x and y.
This leads to the multivariable chain rule for this case:
dtdz=∂x∂zdtdx+∂y∂zdtdyLet's break this down:
We can visualize these dependencies using a simple graph:
A dependency graph illustrating how changes in t propagate through intermediate variables x and y to affect the final output z. The labels indicate the relevant derivatives along each path.
Now, let's consider a more general situation common in machine learning. Suppose we have a final output variable z that depends on several intermediate variables u1,u2,...,um. Each of these intermediate variables, in turn, depends on several input variables x1,x2,...,xn.
So, z=f(u1,u2,...,um) and ui=gi(x1,x2,...,xn) for each i from 1 to m.
We often want to know how the final output z changes with respect to one specific input variable, say xj. To find the partial derivative ∂xj∂z, we need to consider all the paths through which xj can influence z. The input xj can affect any of the intermediate variables u1,u2,...,um, and each of these can, in turn, affect z.
The general form of the multivariable chain rule sums up the influence along all these intermediate paths:
∂xj∂z=i=1∑m∂ui∂z∂xj∂uiThis formula states that the total rate of change of z with respect to xj is the sum, over all intermediate variables ui, of: (the rate of change of z with respect to ui) times (the rate of change of ui with respect to xj).
This generalized chain rule is the mathematical engine behind backpropagation in neural networks. Think of z as the network's loss function (e.g., mean squared error). The inputs xj could be the network's weights or biases in a particular layer. The intermediate variables ui represent the activations or outputs of neurons in subsequent layers.
To train the network using gradient descent, we need to compute the gradient of the loss z with respect to each weight and bias xj. The network structure creates a deep chain of dependencies: the loss depends on the final layer's output, which depends on the previous layer's output and weights, and so on, all the way back to the specific weight xj.
The multivariable chain rule provides the recipe for calculating these required gradients, ∂xj∂z. Backpropagation is essentially an algorithm that efficiently applies this chain rule recursively, layer by layer, starting from the output layer and working backward toward the input layer, computing the necessary partial derivatives at each step. Understanding this rule is therefore fundamental to understanding how neural networks learn. We will see exactly how this is applied in the context of backpropagation later in this chapter.
© 2025 ApX Machine Learning