Okay, let's think about a typical neural network, even a relatively simple one. You might have an input layer, one or more hidden layers, and an output layer. Data flows through this network, undergoing transformations at each step. How can we describe this mathematically?
Imagine a single neuron first. It takes some inputs, computes a weighted sum (plus a bias), and then passes that sum through an activation function. This looks like:
Here, g is the activation function (like Sigmoid, ReLU, etc.). The output a is a function of the inputs x.
Now, consider a network with multiple layers. The output activations from one layer (a(l)) become the inputs for the next layer (x(l+1)). Let's trace the path for a simple network with one hidden layer:
Look closely at the final output y^. We can write it out by substituting the intermediate steps:
y^=g(2)(z(2)) y^=g(2)(W(2)a(1)+b(2)) y^=g(2)(W(2)g(1)(z(1))+b(2)) y^=g(2)(W(2)g(1)(W(1)x+b(1))+b(2))
This equation makes it clear: a neural network is a mathematical representation of a composite function. It's a function (g(2)) applied to the result of another calculation (W(2)⋅+b(2)), which itself involves the output of another function (g(1)), and so on, all the way back to the original input x.
Think of it like nesting Russian dolls. The final output depends on the output layer's calculation, which depends on the hidden layer's output, which depends on the hidden layer's calculation, which depends on the input x and the parameters W(1),b(1).
We can visualize this flow:
A simple feedforward neural network viewed as a sequence of functional transformations. Each layer applies a linear transformation followed by a non-linear activation. The output of one layer feeds into the next, creating a composite function.
Why is this perspective important? Because when we train a neural network, we usually have a loss function (or cost function), let's call it L, that measures how far the network's output y^ is from the true target value y. The loss L depends on y^. And y^, as we just saw, is a complex function involving all the weights (W(1),W(2),...) and biases (b(1),b(2),...) in the network.
To train the network using gradient descent (which we discussed in the previous chapter), we need to calculate the gradient of the loss L with respect to every single weight and bias in the network. For example, we need ∂W(1)∂L.
Because L depends on y^, which depends on a(1), which depends on z(1), which finally depends on W(1), we have a chain of dependencies. Calculating this derivative requires navigating through this nested structure. This is precisely where the chain rule becomes indispensable. It provides the mathematical machinery to compute these gradients efficiently by breaking down the complex derivative into a product of simpler derivatives at each step of the network's computation. This systematic application of the chain rule in neural networks is the core idea behind the backpropagation algorithm, which we will explore next.
© 2025 ApX Machine Learning