A practical example of backpropagation is presented here. Using a very simple neural network, we manually compute the gradients with the chain rule. This demonstration clarifies the calculations required for training neural networks.Setting up a Simple NetworkConsider a tiny neural network with one input feature $x$, one hidden layer containing a single neuron, and one output neuron. Both neurons will use the sigmoid activation function, $\sigma(z) = 1 / (1 + e^{-z})$. Our goal is to predict a target value $y_{true}$ given the input $x$. We'll use the Mean Squared Error (MSE) loss function, specifically $L = \frac{1}{2}(y_{pred} - y_{true})^2$, where $y_{pred}$ is the network's output. The factor of $\frac{1}{2}$ is often added to simplify the derivative later.Here are the components:Input: $x$Target: $y_{true}$Hidden Layer:Weight: $w_1$Bias: $b_1$Linear combination: $z_1 = w_1 x + b_1$Activation: $a_1 = \sigma(z_1)$Output Layer:Weight: $w_2$Bias: $b_2$Linear combination: $z_2 = w_2 a_1 + b_2$Activation (Prediction): $y_{pred} = a_2 = \sigma(z_2)$Loss Function: $L = \frac{1}{2}(a_2 - y_{true})^2$Our objective is to find the gradients of the loss $L$ with respect to each parameter: $\frac{\partial L}{\partial w_1}$, $\frac{\partial L}{\partial b_1}$, $\frac{\partial L}{\partial w_2}$, and $\frac{\partial L}{\partial b_2}$. These gradients tell us how a small change in each parameter affects the loss, guiding the learning process.Visualizing the ComputationsWe can represent this network and the flow of calculations using a computational graph.digraph G { rankdir=LR; node [shape=circle, style=filled, color="#ced4da"]; edge [color="#868e96"]; subgraph cluster_input { label="Input"; style=filled; color="#e9ecef"; x [label="x", shape=plaintext, fontsize=12]; y_true [label="y_true", shape=plaintext, fontsize=12]; } subgraph cluster_hidden { label="Hidden Layer"; style=filled; color="#e9ecef"; w1 [label="w₁", shape=plaintext, fontsize=12]; b1 [label="b₁", shape=plaintext, fontsize=12]; z1 [label="z₁", color="#a5d8ff"]; a1 [label="a₁", color="#74c0fc"]; sigma1 [label="σ", shape=diamond, color="#1c7ed6", style=filled, fontcolor="white", fontsize=12]; } subgraph cluster_output { label="Output Layer"; style=filled; color="#e9ecef"; w2 [label="w₂", shape=plaintext, fontsize=12]; b2 [label="b₂", shape=plaintext, fontsize=12]; z2 [label="z₂", color="#a5d8ff"]; a2 [label="a₂ (y_pred)", color="#74c0fc"]; sigma2 [label="σ", shape=diamond, color="#1c7ed6", style=filled, fontcolor="white", fontsize=12]; } subgraph cluster_loss { label="Loss"; style=filled; color="#e9ecef"; L [label="L", shape=box, color="#ffc9c9"]; loss_func [label="½(• - y_true)²", shape=plaintext, fontsize=10]; } # Connections x -> z1 [label="*"]; w1 -> z1 [label="*"]; b1 -> z1 [label="+"]; z1 -> sigma1; sigma1 -> a1; a1 -> z2 [label="*"]; w2 -> z2 [label="*"]; b2 -> z2 [label="+"]; z2 -> sigma2; sigma2 -> a2; a2 -> loss_func; y_true -> loss_func; loss_func -> L; }Computational graph showing the forward pass from input $x$ to loss $L$. The backward pass involves calculating gradients by moving from $L$ back towards the inputs and parameters.The Forward Pass: A Numerical ExampleLet's assign some specific values:Input: $x = 2.0$Target: $y_{true} = 1.0$Initial Weights: $w_1 = 0.5$, $w_2 = -0.3$Initial Biases: $b_1 = 0.1$, $b_2 = 0.2$Now, calculate the network's output step-by-step:Hidden Layer Input: $z_1 = w_1 x + b_1 = (0.5)(2.0) + 0.1 = 1.0 + 0.1 = 1.1$Hidden Layer Activation: $a_1 = \sigma(z_1) = \sigma(1.1) = \frac{1}{1 + e^{-1.1}} \approx \frac{1}{1 + 0.3329} \approx 0.7503$Output Layer Input: $z_2 = w_2 a_1 + b_2 = (-0.3)(0.7503) + 0.2 = -0.2251 + 0.2 = -0.0251$Output Layer Activation (Prediction): $a_2 = \sigma(z_2) = \sigma(-0.0251) = \frac{1}{1 + e^{-(-0.0251)}} \approx \frac{1}{1 + 1.0254} \approx 0.4937$ So, $y_{pred} = 0.4937$.Calculate Loss: $L = \frac{1}{2}(a_2 - y_{true})^2 = \frac{1}{2}(0.4937 - 1.0)^2 = \frac{1}{2}(-0.5063)^2 \approx \frac{1}{2}(0.2563) \approx 0.1282$The initial prediction is $0.4937$, quite far from the target $1.0$, resulting in a loss of $0.1282$. Now we need the gradients to update the weights and biases to reduce this loss.The Backward Pass: Applying the Chain RuleWe compute gradients starting from the loss and moving backward through the network. Remember the derivative of the sigmoid function: $\sigma'(z) = \frac{d\sigma(z)}{dz} = \sigma(z)(1 - \sigma(z))$.Gradient of Loss w.r.t. Output Activation ($a_2$): $\frac{\partial L}{\partial a_2} = \frac{\partial}{\partial a_2} \left[ \frac{1}{2}(a_2 - y_{true})^2 \right] = (a_2 - y_{true})$ Using our values: $\frac{\partial L}{\partial a_2} = 0.4937 - 1.0 = -0.5063$Gradient of Loss w.r.t. Output Layer Input ($z_2$): Use the chain rule: $\frac{\partial L}{\partial z_2} = \frac{\partial L}{\partial a_2} \frac{\partial a_2}{\partial z_2}$ We need $\frac{\partial a_2}{\partial z_2} = \sigma'(z_2) = \sigma(z_2)(1 - \sigma(z_2)) = a_2 (1 - a_2)$ $\sigma'(z_2) \approx 0.4937 (1 - 0.4937) \approx 0.4937 \times 0.5063 \approx 0.2500$ So, $\frac{\partial L}{\partial z_2} = (-0.5063) \times (0.2500) \approx -0.1266$Gradient of Loss w.r.t. Output Weight ($w_2$): Use the chain rule again: $\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial z_2} \frac{\partial z_2}{\partial w_2}$ We know $z_2 = w_2 a_1 + b_2$, so $\frac{\partial z_2}{\partial w_2} = a_1$. $\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial z_2} \times a_1 \approx (-0.1266) \times (0.7503) \approx -0.0950$Gradient of Loss w.r.t. Output Bias ($b_2$): $\frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial z_2} \frac{\partial z_2}{\partial b_2}$ Since $z_2 = w_2 a_1 + b_2$, $\frac{\partial z_2}{\partial b_2} = 1$. $\frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial z_2} \times 1 \approx -0.1266$Gradient of Loss w.r.t. Hidden Activation ($a_1$): This gradient is needed to propagate further back. $\frac{\partial L}{\partial a_1} = \frac{\partial L}{\partial z_2} \frac{\partial z_2}{\partial a_1}$ Since $z_2 = w_2 a_1 + b_2$, $\frac{\partial z_2}{\partial a_1} = w_2$. $\frac{\partial L}{\partial a_1} = \frac{\partial L}{\partial z_2} \times w_2 \approx (-0.1266) \times (-0.3) \approx 0.0380$Gradient of Loss w.r.t. Hidden Layer Input ($z_1$): $\frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial a_1} \frac{\partial a_1}{\partial z_1}$ We need $\frac{\partial a_1}{\partial z_1} = \sigma'(z_1) = \sigma(z_1)(1 - \sigma(z_1)) = a_1 (1 - a_1)$ $\sigma'(z_1) \approx 0.7503 (1 - 0.7503) \approx 0.7503 \times 0.2497 \approx 0.1874$ So, $\frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial a_1} \times \sigma'(z_1) \approx (0.0380) \times (0.1874) \approx 0.0071$Gradient of Loss w.r.t. Hidden Weight ($w_1$): $\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial z_1} \frac{\partial z_1}{\partial w_1}$ Since $z_1 = w_1 x + b_1$, $\frac{\partial z_1}{\partial w_1} = x$. $\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial z_1} \times x \approx (0.0071) \times (2.0) \approx 0.0142$Gradient of Loss w.r.t. Hidden Bias ($b_1$): $\frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial z_1} \frac{\partial z_1}{\partial b_1}$ Since $z_1 = w_1 x + b_1$, $\frac{\partial z_1}{\partial b_1} = 1$. $\frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial z_1} \times 1 \approx 0.0071$We have now computed all the required gradients:$\frac{\partial L}{\partial w_2} \approx -0.0950$$\frac{\partial L}{\partial b_2} \approx -0.1266$$\frac{\partial L}{\partial w_1} \approx 0.0142$$\frac{\partial L}{\partial b_1} \approx 0.0071$These gradients are the core output of the backpropagation step for this single data point.Connection to Gradient DescentThese calculated gradients are exactly what we need for the update step in gradient descent (or its variants like SGD or mini-batch GD). For example, using standard gradient descent, the update rules would look like this (where $\eta$ is the learning rate):$w_1 \leftarrow w_1 - \eta \frac{\partial L}{\partial w_1}$ $b_1 \leftarrow b_1 - \eta \frac{\partial L}{\partial b_1}$ $w_2 \leftarrow w_2 - \eta \frac{\partial L}{\partial w_2}$ $b_2 \leftarrow b_2 - \eta \frac{\partial L}{\partial b_2}$Plugging in our calculated gradients (assuming $\eta = 0.1$ for illustration):$w_1 \leftarrow 0.5 - 0.1 \times (0.0142) = 0.5 - 0.00142 = 0.49858$ $b_1 \leftarrow 0.1 - 0.1 \times (0.0071) = 0.1 - 0.00071 = 0.09929$ $w_2 \leftarrow -0.3 - 0.1 \times (-0.0950) = -0.3 + 0.00950 = -0.29050$ $b_2 \leftarrow 0.2 - 0.1 \times (-0.1266) = 0.2 + 0.01266 = 0.21266$After just one update step, the parameters have been slightly adjusted in directions that are expected to reduce the loss for this specific input-output pair. Repeating this process (forward pass, backward pass, parameter update) over many data points and iterations allows the network to learn complex patterns.Python Implementation SnippetLet's verify these calculations with a short Python snippet using basic math functions.import math # Sigmoid function and its derivative def sigmoid(z): return 1 / (1 + math.exp(-z)) def sigmoid_prime(z): # Calculate using the output of sigmoid: sigma(z) * (1 - sigma(z)) s_z = sigmoid(z) return s_z * (1 - s_z) # Alternatively: return math.exp(-z) / ((1 + math.exp(-z))**2) # Input and Target x = 2.0 y_true = 1.0 # Initial Parameters w1 = 0.5 b1 = 0.1 w2 = -0.3 b2 = 0.2 # --- Forward Pass --- # Hidden Layer z1 = w1 * x + b1 a1 = sigmoid(z1) # Output Layer z2 = w2 * a1 + b2 a2 = sigmoid(z2) # y_pred # Loss loss = 0.5 * (a2 - y_true)**2 print(f"--- Forward Pass ---") print(f"z1: {z1:.4f}") print(f"a1: {a1:.4f}") print(f"z2: {z2:.4f}") print(f"a2 (y_pred): {a2:.4f}") print(f"Loss: {loss:.4f}\n") # --- Backward Pass --- # Gradients for Output Layer dL_da2 = a2 - y_true da2_dz2 = sigmoid_prime(z2) # Or use a2 * (1 - a2) dL_dz2 = dL_da2 * da2_dz2 dL_dw2 = dL_dz2 * a1 dL_db2 = dL_dz2 * 1.0 # Gradients for Hidden Layer dz2_da1 = w2 dL_da1 = dL_dz2 * dz2_da1 da1_dz1 = sigmoid_prime(z1) # Or use a1 * (1 - a1) dL_dz1 = dL_da1 * da1_dz1 dL_dw1 = dL_dz1 * x dL_db1 = dL_dz1 * 1.0 print(f"--- Backward Pass (Gradients) ---") print(f"dL/da2: {dL_da2:.4f}") print(f"dL/dz2: {dL_dz2:.4f}") print(f"dL/dw2: {dL_dw2:.4f}") print(f"dL/db2: {dL_db2:.4f}") print(f"dL/da1: {dL_da1:.4f}") print(f"dL/dz1: {dL_dz1:.4f}") print(f"dL/dw1: {dL_dw1:.4f}") print(f"dL/db1: {dL_db1:.4f}") # --- Parameter Update Example (Learning Rate = 0.1) --- lr = 0.1 w1_new = w1 - lr * dL_dw1 b1_new = b1 - lr * dL_db1 w2_new = w2 - lr * dL_dw2 b2_new = b2 - lr * dL_db2 print(f"\n--- Parameter Update (lr = {lr}) ---") print(f"New w1: {w1_new:.5f}") print(f"New b1: {b1_new:.5f}") print(f"New w2: {w2_new:.5f}") print(f"New b2: {b2_new:.5f}")Running this code produces results that match our manual calculations (allowing for minor floating-point differences):--- Forward Pass --- z1: 1.1000 a1: 0.7503 z2: -0.0251 a2 (y_pred): 0.4937 Loss: 0.1282 --- Backward Pass (Gradients) --- dL/da2: -0.5063 dL/dz2: -0.1266 dL/dw2: -0.0950 dL/db2: -0.1266 dL/da1: 0.0380 dL/dz1: 0.0071 dL/dw1: 0.0142 dL/db1: 0.0071 --- Parameter Update (lr = 0.1) --- New w1: 0.49858 New b1: 0.09929 New w2: -0.29050 New b2: 0.21266This example demonstrates the mechanical process of backpropagation. It's essentially a systematic application of the chain rule to compute how sensitive the final loss is to each parameter in the network, propagating these sensitivities backward layer by layer. While deep learning frameworks automate this, understanding the underlying calculus is fundamental to grasping how neural networks learn.