Okay, we've established that backpropagation is essentially the chain rule applied systematically backward through the network layers. The ultimate goal of this process is to figure out how much a small change in each weight and bias affects the overall loss function L. This information, captured by the gradients ∂W∂L and ∂b∂L, is exactly what gradient descent needs to update the parameters and improve the model.
Let's zoom in on a specific layer, say layer l, within the network. During the forward pass, this layer took the activation a[l−1] from the previous layer (or the input features if l=1) and computed its own output activation a[l] through two steps:
Here, W[l] and b[l] are the weight matrix and bias vector for layer l, and g[l] is its activation function.
The backpropagation algorithm works layer by layer, starting from the final layer and moving toward the input. When we reach layer l during the backward pass, we assume we have already calculated ∂a[l]∂L, the gradient of the loss with respect to the output activation of this layer. Our task now is to use this information to compute:
Let's see how the chain rule helps us compute these quantities.
First, we need the gradient of the loss with respect to the pre-activation z[l]. Since L depends on a[l], and a[l] depends directly on z[l] via the activation function g[l], we apply the chain rule:
∂z[l]∂L=∂a[l]∂L∂z[l]∂a[l]Recall that a[l]=g[l](z[l]). Therefore, the second term is simply the derivative of the activation function, g′[l](z[l]).
∂z[l]∂L=∂a[l]∂Lg′[l](z[l])This gradient, ∂z[l]∂L, represents how the loss changes with respect to the linear combination calculated just before the activation function in layer l. It's a significant intermediate value often denoted as δ[l] or, in code, dZ[l]
. It essentially carries the error signal backward through the activation function's derivative.
Now that we have ∂z[l]∂L, we can find the gradients for the parameters W[l] and b[l]. Remember that z[l]=W[l]a[l−1]+b[l].
For Weights (W[l]): The loss L depends on z[l], which in turn depends on W[l]. Applying the chain rule:
∂W[l]∂L=∂z[l]∂L∂W[l]∂z[l]We need the partial derivative of z[l] with respect to W[l]. Looking at z[l]=W[l]a[l−1]+b[l], the derivative ∂W[l]∂z[l] turns out to be a[l−1] (specifically, its transpose in matrix notation, due to the rules of matrix calculus). Intuitively, how much z[l] changes when W[l] changes depends on the input activation a[l−1] it was multiplied by.
So, the gradient for the weights is:
∂W[l]∂L=∂z[l]∂L(a[l−1])TIn implementations using vectorized operations (handling multiple examples in a batch, denoted by m), this is often calculated as:
dW[l]=m1dZ[l](A[l−1])Twhere dZ[l] represents ∂Z[l]∂L (the matrix form of ∂z[l]∂L), A[l−1] contains the activations from the previous layer for all examples in the batch, and dW[l] is the resulting gradient matrix for the weights. The m1 factor averages the gradient over the batch.
For Biases (b[l]): Similarly, L depends on z[l], which depends on b[l].
∂b[l]∂L=∂z[l]∂L∂b[l]∂z[l]The derivative ∂b[l]∂z[l] is simply 1, because z[l]=W[l]a[l−1]+b[l] changes by exactly the same amount as b[l] changes (holding other variables constant).
∂b[l]∂L=∂z[l]∂L⋅1=∂z[l]∂LSo, the gradient for the bias is just the gradient with respect to the pre-activation z[l]. When working with batches, the gradient dZ[l] will typically have dimensions corresponding to (number of neurons in layer l, number of examples m). The bias b[l] is usually a vector applied to all examples. Therefore, the gradient db[l] is computed by summing (or averaging) dZ[l] across the batch dimension:
db[l]=m1i=1∑m(dZ[l])(i)where (dZ[l])(i) represents the i-th column (example) of the dZ[l] matrix.
Finally, to continue the backward pass, we need to compute ∂a[l−1]∂L. This tells layer l−1 how its output activations affected the overall loss. Again, we use the chain rule, noting that L depends on z[l], which depends on a[l−1].
∂a[l−1]∂L=∂z[l]∂L∂a[l−1]∂z[l]From z[l]=W[l]a[l−1]+b[l], the derivative ∂a[l−1]∂z[l] is the weight matrix W[l] (specifically, its transpose according to matrix calculus rules). Intuitively, the influence of a[l−1] on z[l] is determined by the weights W[l] it was multiplied by.
∂a[l−1]∂L=(W[l])T∂z[l]∂LIn vectorized form:
dA[l−1]=(W[l])TdZ[l]This dA[l−1] becomes the input gradient ∂a[l−1]∂L for the next step of backpropagation when we process layer l−1.
During the backward pass for layer l, assuming we have dA[l]=∂A[l]∂L, we compute:
This computed dA[l−1] is then passed backward to layer l−1, and the process repeats until we reach the input layer. This systematic application of the chain rule allows us to efficiently find the gradients for all weights and biases in the network. The diagram below illustrates this flow for a single layer.
Gradient computation flow for layer l, showing vectorized operations. The backward pass uses the incoming gradient dA[l] to compute dZ[l], which then allows calculation of the parameter gradients dW[l] and db[l] (yellow) and the gradient dA[l−1] (red) needed for the previous layer.
© 2025 ApX Machine Learning