Computing derivatives for functions of a single variable is a foundational concept. For functions composed together, such as $f(g(x))$, the single-variable chain rule ($\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}$) helps determine the overall rate of change. Machine learning models, especially neural networks, often involve functions with multiple inputs and outputs, creating more complex dependencies. A method to extend the chain rule is required to handle these multivariable scenarios.Imagine a function $z$ that depends on two intermediate variables, $x$ and $y$. Furthermore, imagine that both $x$ and $y$ themselves depend on a single underlying variable, $t$. So we have relationships like $z = f(x, y)$, $x = g(t)$, and $y = h(t)$. How does $z$ change as $t$ changes?Since $t$ affects $z$ through two paths (one via $x$, one via $y$), we need to account for the change propagated along both paths. The change in $z$ due to a small change in $t$ is the sum of the changes propagated through $x$ and $y$.This leads to the multivariable chain rule for this case:$$ \frac{dz}{dt} = \frac{\partial z}{\partial x} \frac{dx}{dt} + \frac{\partial z}{\partial y} \frac{dy}{dt} $$Let's break this down:$\frac{\partial z}{\partial x}$: This is the partial derivative of $z$ with respect to $x$. It tells us how sensitive $z$ is to changes in $x$, holding $y$ constant. We use the partial derivative symbol ($\partial$) because $z$ is a function of more than one variable ($x$ and $y$).$\frac{dx}{dt}$: This is the ordinary derivative of $x$ with respect to $t$. It tells us how fast $x$ changes when $t$ changes. We use the ordinary derivative symbol ($d$) because $x$ is a function of only one variable ($t$).$\frac{\partial z}{\partial x} \frac{dx}{dt}$: This product represents the rate of change in $z$ caused by the change in $t$ flowing through $x$.Similarly, $\frac{\partial z}{\partial y} \frac{dy}{dt}$ represents the rate of change in $z$ caused by the change in $t$ flowing through $y$.The sum combines the contributions from both paths.We can visualize these dependencies using a simple graph:digraph G { rankdir=LR; node [shape=circle, style=filled, fillcolor="#a5d8ff", fontname="Arial"]; edge [fontname="Arial"]; t [fillcolor="#ffec99"]; z [fillcolor="#b2f2bb"]; t -> x [label=" dx/dt"]; t -> y [label=" dy/dt"]; x -> z [label=" ∂z/∂x"]; y -> z [label=" ∂z/∂y"]; }A dependency graph illustrating how changes in $t$ propagate through intermediate variables $x$ and $y$ to affect the final output $z$. The labels indicate the relevant derivatives along each path.Generalizing the Multivariable Chain RuleNow, let's consider a more general situation common in machine learning. Suppose we have a final output variable $z$ that depends on several intermediate variables $u_1, u_2, ..., u_m$. Each of these intermediate variables, in turn, depends on several input variables $x_1, x_2, ..., x_n$.So, $z = f(u_1, u_2, ..., u_m)$ and $u_i = g_i(x_1, x_2, ..., x_n)$ for each $i$ from 1 to $m$.We often want to know how the final output $z$ changes with respect to one specific input variable, say $x_j$. To find the partial derivative $\frac{\partial z}{\partial x_j}$, we need to consider all the paths through which $x_j$ can influence $z$. The input $x_j$ can affect any of the intermediate variables $u_1, u_2, ..., u_m$, and each of these can, in turn, affect $z$.The general form of the multivariable chain rule sums up the influence along all these intermediate paths:$$ \frac{\partial z}{\partial x_j} = \sum_{i=1}^{m} \frac{\partial z}{\partial u_i} \frac{\partial u_i}{\partial x_j} $$This formula states that the total rate of change of $z$ with respect to $x_j$ is the sum, over all intermediate variables $u_i$, of: (the rate of change of $z$ with respect to $u_i$) times (the rate of change of $u_i$ with respect to $x_j$).Relevance to Neural NetworksThis generalized chain rule is the mathematical engine behind backpropagation in neural networks. Think of $z$ as the network's loss function (e.g., mean squared error). The inputs $x_j$ could be the network's weights or biases in a particular layer. The intermediate variables $u_i$ represent the activations or outputs of neurons in subsequent layers.To train the network using gradient descent, we need to compute the gradient of the loss $z$ with respect to each weight and bias $x_j$. The network structure creates a deep chain of dependencies: the loss depends on the final layer's output, which depends on the previous layer's output and weights, and so on, all the way back to the specific weight $x_j$.The multivariable chain rule provides the recipe for calculating these required gradients, $\frac{\partial z}{\partial x_j}$. Backpropagation is essentially an algorithm that efficiently applies this chain rule recursively, layer by layer, starting from the output layer and working backward toward the input layer, computing the necessary partial derivatives at each step. Understanding this rule is therefore fundamental to understanding how neural networks learn. We will see exactly how this is applied in the context of backpropagation later in this chapter.