Training a neural network involves adjusting its weights and biases to minimize a loss function. Backpropagation provides an efficient way to compute the gradient of the loss function with respect to every parameter in the network, which is exactly what gradient descent needs to perform these adjustments. A concrete example will illustrate the backpropagation algorithm step-by-step.We'll use a very simple network: one input neuron, a single hidden layer with two neurons, and one output neuron. This is small enough to track all calculations manually.1. Network Setup and Forward PassLet's define our simple network:Input: $x = 0.5$Hidden Layer (2 neurons): Sigmoid activation function.Output Layer (1 neuron): Sigmoid activation function.Weights & Biases (initialized randomly):Input to Hidden: $w_{1} = 0.1$, $w_{2} = 0.2$Hidden Biases: $b_{h1} = 0.1$, $b_{h2} = 0.1$Hidden to Output: $w_{3} = 0.3$, $w_{4} = 0.4$Output Bias: $b_{o} = 0.1$Target Output: $y = 1.0$Loss Function: Mean Squared Error (MSE): $L = \frac{1}{2}(\hat{y} - y)^2$. (We use $\frac{1}{2}$ for convenience, as its derivative cancels out the exponent 2).Activation Function: Sigmoid: $\sigma(z) = \frac{1}{1 + e^{-z}}$. Its derivative is $\sigma'(z) = \sigma(z)(1 - \sigma(z))$.Let's perform the forward pass calculation step-by-step:a. Input to Hidden Layer:Weighted sum for hidden neuron 1: $z_{h1} = w_1 \cdot x + b_{h1} = (0.1 \cdot 0.5) + 0.1 = 0.05 + 0.1 = 0.15$Activation for hidden neuron 1: $h_1 = \sigma(z_{h1}) = \sigma(0.15) \approx 0.5374$Weighted sum for hidden neuron 2: $z_{h2} = w_2 \cdot x + b_{h2} = (0.2 \cdot 0.5) + 0.1 = 0.1 + 0.1 = 0.20$Activation for hidden neuron 2: $h_2 = \sigma(z_{h2}) = \sigma(0.20) \approx 0.5498$b. Hidden Layer to Output Layer:Weighted sum for output neuron: $z_{o} = w_3 \cdot h_1 + w_4 \cdot h_2 + b_{o} = (0.3 \cdot 0.5374) + (0.4 \cdot 0.5498) + 0.1$ $z_{o} \approx 0.1612 + 0.2199 + 0.1 = 0.4811$Final prediction (output activation): $\hat{y} = \sigma(z_{o}) = \sigma(0.4811) \approx 0.6180$c. Calculate Loss:Loss: $L = \frac{1}{2}(\hat{y} - y)^2 = \frac{1}{2}(0.6180 - 1.0)^2 = \frac{1}{2}(-0.3820)^2 \approx \frac{1}{2}(0.1459) \approx 0.0730$Our network currently predicts $0.6180$, which is quite far from the target $1.0$, resulting in a loss of $0.0730$. Now, let's use backpropagation to find out how to adjust the weights and biases to reduce this loss.2. Backward Pass: Calculating GradientsThe core idea is to use the chain rule from calculus to calculate $\frac{\partial L}{\partial w}$ for every weight $w$ (and similarly for biases $b$). We start from the end (the loss) and work backward.a. Gradients for Output Layer Weights ($w_3, w_4$) and Bias ($b_o$)We need $\frac{\partial L}{\partial w_3}$, $\frac{\partial L}{\partial w_4}$, and $\frac{\partial L}{\partial b_o}$. Let's break down $\frac{\partial L}{\partial w_3}$ using the chain rule:$$ \frac{\partial L}{\partial w_3} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_o} \cdot \frac{\partial z_o}{\partial w_3} $$Let's calculate each part:$\frac{\partial L}{\partial \hat{y}}$: Derivative of the loss w.r.t. the prediction. $L = \frac{1}{2}(\hat{y} - y)^2 \implies \frac{\partial L}{\partial \hat{y}} = (\hat{y} - y) = 0.6180 - 1.0 = -0.3820$$\frac{\partial \hat{y}}{\partial z_o}$: Derivative of the output activation function w.r.t. its input ($z_o$). Since $\hat{y} = \sigma(z_o)$, the derivative is $\sigma'(z_o) = \sigma(z_o)(1 - \sigma(z_o))$. $\frac{\partial \hat{y}}{\partial z_o} = \hat{y}(1 - \hat{y}) = 0.6180 \cdot (1 - 0.6180) = 0.6180 \cdot 0.3820 \approx 0.2361$$\frac{\partial z_o}{\partial w_3}$: Derivative of the output weighted sum w.r.t. $w_3$. $z_{o} = w_3 h_1 + w_4 h_2 + b_{o} \implies \frac{\partial z_o}{\partial w_3} = h_1 \approx 0.5374$Now, combine them: $$ \frac{\partial L}{\partial w_3} = (-0.3820) \cdot (0.2361) \cdot (0.5374) \approx -0.0485 $$Similarly for $w_4$: The only difference is the last term in the chain rule. $$ \frac{\partial z_o}{\partial w_4} = h_2 \approx 0.5498 $$ $$ \frac{\partial L}{\partial w_4} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_o} \cdot \frac{\partial z_o}{\partial w_4} = (-0.3820) \cdot (0.2361) \cdot (0.5498) \approx -0.0496 $$And for the output bias $b_o$: $$ \frac{\partial z_o}{\partial b_o} = 1 $$ $$ \frac{\partial L}{\partial b_o} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_o} \cdot \frac{\partial z_o}{\partial b_o} = (-0.3820) \cdot (0.2361) \cdot (1) \approx -0.0902 $$Often, the first two terms are combined into a single "delta" term for the output layer: $\delta_o = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_o} \approx -0.0902$. Then the gradients are simply $\frac{\partial L}{\partial w_3} = \delta_o \cdot h_1$, $\frac{\partial L}{\partial w_4} = \delta_o \cdot h_2$, and $\frac{\partial L}{\partial b_o} = \delta_o$.b. Gradients for Hidden Layer Weights ($w_1, w_2$) and Biases ($b_{h1}, b_{h2}$)Now we need to propagate the error backward to the hidden layer. We need $\frac{\partial L}{\partial w_1}$, $\frac{\partial L}{\partial w_2}$, $\frac{\partial L}{\partial b_{h1}}$, and $\frac{\partial L}{\partial b_{h2}}$. Let's focus on $w_1$.The chain rule path is longer: $$ \frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_o} \cdot \frac{\partial z_o}{\partial h_1} \cdot \frac{\partial h_1}{\partial z_{h1}} \cdot \frac{\partial z_{h1}}{\partial w_1} $$ We can reuse $\delta_o = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_o} \approx -0.0902$.Let's calculate the new parts:$\frac{\partial z_o}{\partial h_1}$: Derivative of the output weighted sum w.r.t. the activation of the first hidden neuron. $z_{o} = w_3 h_1 + w_4 h_2 + b_{o} \implies \frac{\partial z_o}{\partial h_1} = w_3 = 0.3$$\frac{\partial h_1}{\partial z_{h1}}$: Derivative of the first hidden neuron's activation w.r.t. its input. $h_1 = \sigma(z_{h1})$, so the derivative is $\sigma'(z_{h1}) = h_1(1 - h_1)$. $\frac{\partial h_1}{\partial z_{h1}} = 0.5374 \cdot (1 - 0.5374) = 0.5374 \cdot 0.4626 \approx 0.2486$$\frac{\partial z_{h1}}{\partial w_1}$: Derivative of the first hidden neuron's weighted sum w.r.t. $w_1$. $z_{h1} = w_1 x + b_{h1} \implies \frac{\partial z_{h1}}{\partial w_1} = x = 0.5$Now, combine everything for $\frac{\partial L}{\partial w_1}$: $$ \frac{\partial L}{\partial w_1} = (\delta_o) \cdot (\frac{\partial z_o}{\partial h_1}) \cdot (\frac{\partial h_1}{\partial z_{h1}}) \cdot (\frac{\partial z_{h1}}{\partial w_1}) $$ $$ \frac{\partial L}{\partial w_1} = (-0.0902) \cdot (0.3) \cdot (0.2486) \cdot (0.5) \approx -0.00337 $$Similarly for $w_2$: The error contribution flows through $h_2$.$$ \frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_o} \cdot \frac{\partial z_o}{\partial h_2} \cdot \frac{\partial h_2}{\partial z_{h2}} \cdot \frac{\partial z_{h2}}{\partial w_2} $$$\frac{\partial z_o}{\partial h_2} = w_4 = 0.4$$\frac{\partial h_2}{\partial z_{h2}} = h_2(1 - h_2) = 0.5498 \cdot (1 - 0.5498) = 0.5498 \cdot 0.4502 \approx 0.2475$$\frac{\partial z_{h2}}{\partial w_2} = x = 0.5$Combine them: $$ \frac{\partial L}{\partial w_2} = (\delta_o) \cdot (\frac{\partial z_o}{\partial h_2}) \cdot (\frac{\partial h_2}{\partial z_{h2}}) \cdot (\frac{\partial z_{h2}}{\partial w_2}) $$ $$ \frac{\partial L}{\partial w_2} = (-0.0902) \cdot (0.4) \cdot (0.2475) \cdot (0.5) \approx -0.00446 $$For the hidden biases $b_{h1}$ and $b_{h2}$: The last part of the chain rule changes to $\frac{\partial z_{h1}}{\partial b_{h1}} = 1$ and $\frac{\partial z_{h2}}{\partial b_{h2}} = 1$.$$ \frac{\partial L}{\partial b_{h1}} = (\delta_o) \cdot (\frac{\partial z_o}{\partial h_1}) \cdot (\frac{\partial h_1}{\partial z_{h1}}) \cdot (1) $$ $$ \frac{\partial L}{\partial b_{h1}} = (-0.0902) \cdot (0.3) \cdot (0.2486) \cdot (1) \approx -0.00673 $$ $$ \frac{\partial L}{\partial b_{h2}} = (\delta_o) \cdot (\frac{\partial z_o}{\partial h_2}) \cdot (\frac{\partial h_2}{\partial z_{h2}}) \cdot (1) $$ $$ \frac{\partial L}{\partial b_{h2}} = (-0.0902) \cdot (0.4) \cdot (0.2475) \cdot (1) \approx -0.00891 $$We can define delta terms for the hidden layer as well: $\delta_{h1} = \delta_o \cdot \frac{\partial z_o}{\partial h_1} \cdot \frac{\partial h_1}{\partial z_{h1}}$ and $\delta_{h2} = \delta_o \cdot \frac{\partial z_o}{\partial h_2} \cdot \frac{\partial h_2}{\partial z_{h2}}$. Then the gradients are $\frac{\partial L}{\partial w_1} = \delta_{h1} \cdot x$, $\frac{\partial L}{\partial w_2} = \delta_{h2} \cdot x$, $\frac{\partial L}{\partial b_{h1}} = \delta_{h1}$, $\frac{\partial L}{\partial b_{h2}} = \delta_{h2}$.3. Visualizing the FlowWe can represent these calculations using a computational graph. Nodes represent values (input, weights, intermediate results, loss), and edges represent operations. Backpropagation involves calculating the gradient at each node by summing the gradients flowing in from its outgoing edges.digraph G { rankdir=LR; node [shape=circle, style=filled, fillcolor="#e9ecef", margin=0.1]; edge [arrowhead=vee]; subgraph cluster_input { label = "Input Layer"; style=dashed; color="#adb5bd"; x [label="x=0.5", fillcolor="#a5d8ff"]; } subgraph cluster_hidden { label = "Hidden Layer"; style=dashed; color="#adb5bd"; z_h1 [label=<z<sub>h1</sub>=0.15>, fillcolor="#ffe066"]; h1 [label=<h<sub>1</sub>=0.537>, fillcolor="#ffd43b"]; z_h2 [label=<z<sub>h2</sub>=0.20>, fillcolor="#ffe066"]; h2 [label=<h<sub>2</sub>=0.550>, fillcolor="#ffd43b"]; } subgraph cluster_output { label = "Output Layer"; style=dashed; color="#adb5bd"; z_o [label=<z<sub>o</sub>=0.481>, fillcolor="#ffc9c9"]; y_hat [label=<ŷ=0.618>, fillcolor="#ffa8a8"]; L [label="L=0.073", fillcolor="#ff8787", shape=box]; } # Weights and Biases (as edge labels or separate nodes if complex) w1 [label="w1=0.1", shape=plaintext, fontcolor="#495057"]; w2 [label="w2=0.2", shape=plaintext, fontcolor="#495057"]; w3 [label="w3=0.3", shape=plaintext, fontcolor="#495057"]; w4 [label="w4=0.4", shape=plaintext, fontcolor="#495057"]; b_h1 [label="bh1=0.1", shape=plaintext, fontcolor="#495057"]; b_h2 [label="bh2=0.1", shape=plaintext, fontcolor="#495057"]; b_o [label="bo=0.1", shape=plaintext, fontcolor="#495057"]; y [label="y=1.0", shape=plaintext, fontcolor="#495057"]; # Forward Pass Edges x -> z_h1 [label=< <font color="#7048e8">w<sub>1</sub>=0.1</font> >, arrowhead=vee, color="#748ffc"]; x -> z_h2 [label=< <font color="#7048e8">w<sub>2</sub>=0.2</font> >, arrowhead=vee, color="#748ffc"]; z_h1 -> h1 [label="σ", arrowhead=vee, color="#f76707"]; z_h2 -> h2 [label="σ", arrowhead=vee, color="#f76707"]; h1 -> z_o [label=< <font color="#7048e8">w<sub>3</sub>=0.3</font> >, arrowhead=vee, color="#748ffc"]; h2 -> z_o [label=< <font color="#7048e8">w<sub>4</sub>=0.4</font> >, arrowhead=vee, color="#748ffc"]; z_o -> y_hat [label="σ", arrowhead=vee, color="#f76707"]; y_hat -> L [label="MSE", arrowhead=vee, color="#f03e3e"]; y -> L [style=dotted, color="#f03e3e"]; // Target influences loss # Indicate Biases (added at z nodes) b_h1 -> z_h1 [style=dotted, color="#adb5bd", arrowhead=none]; b_h2 -> z_h2 [style=dotted, color="#adb5bd", arrowhead=none]; b_o -> z_o [style=dotted, color="#adb5bd", arrowhead=none]; # Backward Pass (- Gradients flow opposite to forward arrows) # Example gradient flow (not drawn explicitly to avoid clutter) # L -> y_hat [label="∂L/∂ŷ", dir=back, color="#12b886", style=dashed]; # y_hat -> z_o [label="∂ŷ/∂zo", dir=back, color="#12b886", style=dashed]; # z_o -> h1 [label="∂zo/∂h1", dir=back, color="#12b886", style=dashed]; # ... etc ... }A computational graph showing the forward pass calculations for our simple network. Backpropagation computes gradients by traversing this graph backward from the loss (L). Solid arrows indicate the forward computation flow; gradients flow in the reverse direction along these paths during backpropagation.4. Weight UpdateWith the gradients calculated, we can update the weights and biases using the gradient descent rule. Let's assume a learning rate $\eta = 0.1$.$w_1^{new} = w_1 - \eta \frac{\partial L}{\partial w_1} = 0.1 - (0.1 \cdot -0.00337) \approx 0.1003$$w_2^{new} = w_2 - \eta \frac{\partial L}{\partial w_2} = 0.2 - (0.1 \cdot -0.00446) \approx 0.2004$$b_{h1}^{new} = b_{h1} - \eta \frac{\partial L}{\partial b_{h1}} = 0.1 - (0.1 \cdot -0.00673) \approx 0.1007$$b_{h2}^{new} = b_{h2} - \eta \frac{\partial L}{\partial b_{h2}} = 0.1 - (0.1 \cdot -0.00891) \approx 0.1009$$w_3^{new} = w_3 - \eta \frac{\partial L}{\partial w_3} = 0.3 - (0.1 \cdot -0.0485) \approx 0.3049$$w_4^{new} = w_4 - \eta \frac{\partial L}{\partial w_4} = 0.4 - (0.1 \cdot -0.0496) \approx 0.4050$$b_o^{new} = b_o - \eta \frac{\partial L}{\partial b_o} = 0.1 - (0.1 \cdot -0.0902) \approx 0.1090$After just one step, the parameters have been slightly adjusted in directions that should reduce the loss on the next forward pass with the same input.5. Backpropagation in PyTorchManually calculating these gradients is instructive but quickly becomes impractical for larger networks. Deep learning frameworks like PyTorch automate this process using automatic differentiation (autograd). Here's how you might set up the same calculation:import torch # --- Setup --- # Use FloatTensor for calculations dtype = torch.float # Use CPU for this simple example device = torch.device("cpu") # Define tensors for input, target, and parameters # requires_grad=True tells PyTorch to track operations for gradient calculation x = torch.tensor([[0.5]], device=device, dtype=dtype) y = torch.tensor([[1.0]], device=device, dtype=dtype) w1 = torch.tensor([[0.1]], device=device, dtype=dtype, requires_grad=True) w2 = torch.tensor([[0.2]], device=device, dtype=dtype, requires_grad=True) bh1 = torch.tensor([[0.1]], device=device, dtype=dtype, requires_grad=True) bh2 = torch.tensor([[0.1]], device=device, dtype=dtype, requires_grad=True) # Combine hidden weights and biases for matrix multiplication later # For this simple example, we'll keep them separate like manual calc w3 = torch.tensor([[0.3]], device=device, dtype=dtype, requires_grad=True) w4 = torch.tensor([[0.4]], device=device, dtype=dtype, requires_grad=True) bo = torch.tensor([[0.1]], device=device, dtype=dtype, requires_grad=True) # --- Forward Pass --- z_h1 = x @ w1.t() + bh1 # Using @ for matrix multiplication (though it's 1x1 here) h1 = torch.sigmoid(z_h1) z_h2 = x @ w2.t() + bh2 h2 = torch.sigmoid(z_h2) # Combine hidden activations (forming a hidden layer output vector) # h_combined = torch.cat((h1, h2), dim=1) # Would need w_ho as a matrix [2,1] # For direct comparison, calculate z_o manually z_o = h1 @ w3.t() + h2 @ w4.t() + bo y_hat = torch.sigmoid(z_o) # --- Loss Calculation --- # Using built-in MSELoss is better, but manual calc for comparison loss = 0.5 * (y_hat - y).pow(2) loss = loss.sum() # Usually average loss over batch, here just sum scalar loss print(f"Input x: {x.item():.4f}") print(f"Target y: {y.item():.4f}") print(f"h1: {h1.item():.4f}, h2: {h2.item():.4f}") print(f"Prediction y_hat: {y_hat.item():.4f}") print(f"Loss: {loss.item():.4f}") # --- Backward Pass (Autograd) --- # Computes gradients for all tensors with requires_grad=True loss.backward() # --- Check Gradients --- # Gradients are stored in the .grad attribute of the tensors print("\n--- Gradients ---") print(f"dL/dw1: {w1.grad.item():.5f} (Manual: -0.00337)") print(f"dL/dw2: {w2.grad.item():.5f} (Manual: -0.00446)") print(f"dL/dbh1: {bh1.grad.item():.5f} (Manual: -0.00673)") print(f"dL/dbh2: {bh2.grad.item():.5f} (Manual: -0.00891)") print(f"dL/dw3: {w3.grad.item():.5f} (Manual: -0.04850)") # Small diff due to precision print(f"dL/dw4: {w4.grad.item():.5f} (Manual: -0.04957)") # Small diff due to precision print(f"dL/dbo: {bo.grad.item():.5f} (Manual: -0.09019)") # Small diff due to precision # Note: Manual calculations used fewer decimal places, leading to tiny discrepancies. # PyTorch uses higher precision.This practical walk-through demonstrates how backpropagation meticulously applies the chain rule to find the contribution of each weight and bias to the overall loss. While frameworks handle the implementation, understanding this flow is fundamental to diagnosing training issues and designing effective network architectures. This calculated gradient information is then fed into optimization algorithms like SGD, Adam, or RMSprop (discussed next) to iteratively improve the model.