Calculating a single gradient manually for a very simple network provides a clear illustration of the underlying mechanics. This direct approach helps in understanding the primary importance of the chain rule.For this section: Don't worry about it if you don't understand.Single gradient calculationLet's take a single neuron with an input $x$, a weight $w$, a bias $b$, and an activation function $\sigma$. The loss is $L$. The calculation is:Pre-activation: $z = wx + b$Activation: $a = \sigma(z)$Loss: $L = \text{Loss}(a, y)$To find the gradient of the loss with respect to the weight, $\frac{\partial L}{\partial w}$, we need to see how a small change in $w$ affects the loss $L$. The chain rule says we can find this by multiplying the rates of change at each step: $$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \times \frac{\partial a}{\partial z} \times \frac{\partial z}{\partial w}$$ Let's break down each term:$\frac{\partial L}{\partial a}$: This is how the loss changes when the neuron's output activation $a$ changes. This is the first thing we'd calculate.$\frac{\partial a}{\partial z}$: This is the derivative of the activation function, $\sigma'(z)$. It tells us how the activation changes based on the pre-activation value.$\frac{\partial z}{\partial w}$: This is the derivative of $wx + b$ with respect to $w$, which is simply $x$.So, to find the gradient for the weight $w$, we'd calculate each of these parts and multiply them. For a multi-layered network, this chain of derivatives would be much longer, connecting a parameter deep inside the network all the way to the final loss. The backpropagation algorithm is simply a structured and efficient way to perform this long chain rule calculation for all parameters at once.The Power of the Chain RuleFundamentally, a neural network is a series of nested functions. The output of one layer becomes the input to the next. To find how a change in a parameter deep inside the network affects the final loss, we need to apply the chain rule repeatedly.Consider a very simple network calculation:Input $x$Hidden layer pre-activation: $z^{(1)} = w^{(1)}x + b^{(1)}$Hidden layer activation: $a^{(1)} = \sigma(z^{(1)})$ (where $\sigma$ is an activation function)Output layer pre-activation: $z^{(2)} = w^{(2)}a^{(1)} + b^{(2)}$Output layer activation (prediction): $a^{(2)} = \sigma(z^{(2)})$Loss: $L = \text{Loss}(a^{(2)}, y)$ (where $y$ is the true label)If we want to find the gradient of the loss $L$ with respect to a weight in the first layer, say $w^{(1)}$, we need to trace its influence forward to the loss: $w^{(1)}$ affects $z^{(1)}$, which affects $a^{(1)}$, which affects $z^{(2)}$, which affects $a^{(2)}$, which finally affects $L$. The chain rule allows us to break this down:$$\frac{\partial L}{\partial w^{(1)}} = \frac{\partial L}{\partial a^{(2)}} \times \frac{\partial a^{(2)}}{\partial z^{(2)}} \times \frac{\partial z^{(2)}}{\partial a^{(1)}} \times \frac{\partial a^{(1)}}{\partial z^{(1)}} \times \frac{\partial z^{(1)}}{\partial w^{(1)}}$$Backpropagation organizes this calculation efficiently. It calculates terms like $\frac{\partial L}{\partial a^{(2)}}$ and $\frac{\partial L}{\partial z^{(2)}}$ first, and then reuses them to compute gradients for earlier layers.The Backward Pass ExplainedThe algorithm works in two main phases:Forward Pass: Input data is fed through the network layer by layer, calculating pre-activations ($z^{(l)}$) and activations ($a^{(l)}$) at each step, ultimately producing the final output prediction $a^{(L)}$. During this pass, we store the intermediate values ($z^{(l)}$ and $a^{(l)}$ for all layers $l$) as they will be needed for the backward pass. The final loss $L$ is calculated using the prediction $a^{(L)}$ and the true target $y$.Backward Pass: This pass computes the gradients.Output Layer: Calculate the gradient of the loss with respect to the output layer's activation, $\frac{\partial L}{\partial a^{(L)}}$. Then, using the chain rule, calculate the gradient with respect to the output layer's pre-activation $z^{(L)}$: $$\delta^{(L)} = \frac{\partial L}{\partial z^{(L)}} = \frac{\partial L}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} = \frac{\partial L}{\partial a^{(L)}} \sigma'(z^{(L)})$$ Here, $\sigma'(z^{(L)})$ is the derivative of the activation function used in the output layer, evaluated at $z^{(L)}$. This term $\delta^{(L)}$ represents the "error signal" at the output layer's pre-activation. Now, calculate the gradients for the output layer's weights $W^{(L)}$ and biases $b^{(L)}$: $$\frac{\partial L}{\partial W^{(L)}} = \frac{\partial L}{\partial z^{(L)}} \frac{\partial z^{(L)}}{\partial W^{(L)}} = \delta^{(L)} (a^{(L-1)})^T$$ $$\frac{\partial L}{\partial b^{(L)}} = \frac{\partial L}{\partial z^{(L)}} \frac{\partial z^{(L)}}{\partial b^{(L)}} = \delta^{(L)}$$ (Note: The exact forms depend on whether we are using vector/matrix notation. $a^{(L-1)}$ represents the activations from the previous layer feeding into the output layer. The transpose ensures dimensions match for matrix multiplication).Hidden Layers (working backward from $l = L-1$ down to $l=1$): For each hidden layer $l$, calculate its error signal $\delta^{(l)}$ based on the error signal $\delta^{(l+1)}$ from the next layer (the one closer to the output): $$\delta^{(l)} = \frac{\partial L}{\partial z^{(l)}} = \left( (W^{(l+1)})^T \delta^{(l+1)} \right) \odot \sigma'(z^{(l)})$$ Here, $(W^{(l+1)})^T$ are the weights connecting layer $l$ to layer $l+1$. This step effectively propagates the error gradient backward through the network weights. The term $\sigma'(z^{(l)})$ is the derivative of the activation function of the current hidden layer $l$. The symbol $\odot$ denotes element-wise multiplication (Hadamard product). Once $\delta^{(l)}$ is known, calculate the gradients for layer $l$'s weights $W^{(l)}$ and biases $b^{(l)}$ similar to the output layer: $$\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T$$ $$\frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}$$ Where $a^{(l-1)}$ are the activations from the layer feeding into layer $l$ (for the first hidden layer, this would be the input data $x$).This process continues backward until gradients for all parameters have been computed.Visualizing the FlowWe can think of the computation as flowing forward to get the loss, and then backward to get the gradients.digraph G { rankdir=LR; node [shape=circle, style=filled, color="#ced4da"]; edge [color="#868e96"]; subgraph cluster_input { label = "Input Layer (l=0)"; style=dashed; color="#adb5bd"; x [label="x", color="#a5d8ff"]; } subgraph cluster_hidden { label = "Hidden Layer (l=1)"; style=dashed; color="#adb5bd"; z1 [label="z(1)", color="#b2f2bb"]; a1 [label="a(1)", color="#8ce99a"]; W1 [label="W(1), b(1)", shape=box, style=filled, color="#e9ecef", fixedsize=true, width=1, height=0.5]; z1 -> a1 [label=" σ(.)", color="#69db7c"]; } subgraph cluster_output { label = "Output Layer (l=2)"; style=dashed; color="#adb5bd"; z2 [label="z(2)", color="#b2f2bb"]; a2 [label="a(2)", color="#8ce99a"]; W2 [label="W(2), b(2)", shape=box, style=filled, color="#e9ecef", fixedsize=true, width=1, height=0.5]; z2 -> a2 [label=" σ(.)", color="#69db7c"]; } subgraph cluster_loss { label = "Loss"; style=dashed; color="#adb5bd"; L [label="L", color="#ffc9c9", shape=diamond]; } // Forward Pass x -> z1 [label=" W(1)x+b(1) ", arrowhead=vee, color="#1c7ed6"]; a1 -> z2 [label=" W(2)a(1)+b(2) ", arrowhead=vee, color="#1c7ed6"]; a2 -> L [label=" Loss(a(2), y) ", arrowhead=vee, color="#1c7ed6"]; // Connections to Parameters W1 -> z1 [style=invis]; W2 -> z2 [style=invis]; // Backward Pass edge [arrowhead=curve, color="#f03e3e", constraint=false]; L -> a2 [label=" ∂L/∂a(2) "]; a2 -> z2 [label=" ∂a(2)/∂z(2)=σ'(z(2)) "]; z2 -> a1 [label=" ∂z(2)/∂a(1)=W(2) "]; z2 -> W2 [label=" ∂L/∂W(2), ∂L/∂b(2) ", style=dashed]; a1 -> z1 [label=" ∂a(1)/∂z(1)=σ'(z(1)) "]; z1 -> x [label=" ∂z(1)/∂x=W(1) ", style=invis]; // Not usually needed for params z1 -> W1 [label=" ∂L/∂W(1), ∂L/∂b(1) ", style=dashed]; {rank=same; x} {rank=same; W1; z1; a1} {rank=same; W2; z2; a2} {rank=same; L} }A simplified view of forward (blue arrows) and backward (red arrows) passes. The forward pass computes activations and loss. The backward pass computes gradients, starting from the loss and propagating error signals (like $\delta^{(l)} = \frac{\partial L}{\partial z^{(l)}}$) backward, reusing values calculated during the forward pass ($a^{(l)}$, $z^{(l)}$) and network weights ($W^{(l)}$). Gradients for parameters ($W, b$) are derived from these error signals.Why is it Efficient?The efficiency of backpropagation comes from two main aspects:Reusing Calculations: Intermediate values ($z^{(l)}$, $a^{(l)}$) computed during the forward pass are stored and reused during the backward pass. More importantly, the error signal ($\delta^{(l+1)}$) computed for layer $l+1$ is directly used to compute the error signal ($\delta^{(l)}$) for layer $l$. This avoids redundant calculations.Dynamic Programming: It essentially uses dynamic programming. By calculating gradients layer by layer from back to front, it builds up the solution for the entire network's gradients without needing to recompute the influence paths for every single parameter independently.Compared to numerically estimating gradients, which requires at least one extra forward pass for each parameter, backpropagation computes all gradients in roughly the same amount of computational time as a single forward pass. For a network with millions of parameters, this difference is enormous, making the training of deep networks feasible.With the gradients computed via backpropagation, we now have the direction needed to update the weights and biases using gradient descent, which we will detail in the following sections.