Alright, let's solidify our understanding of backpropagation by walking through a concrete example. We'll use a very simple neural network and manually compute the gradients using the chain rule, just like the backpropagation algorithm does. This hands-on approach helps clarify how the abstract concepts connect to actual calculations needed for training.
Consider a tiny neural network with one input feature x, one hidden layer containing a single neuron, and one output neuron. Both neurons will use the sigmoid activation function, σ(z)=1/(1+e−z). Our goal is to predict a target value ytrue given the input x. We'll use the Mean Squared Error (MSE) loss function, specifically L=21(ypred−ytrue)2, where ypred is the network's output. The factor of 21 is often added to simplify the derivative later.
Here are the components:
Our objective is to find the gradients of the loss L with respect to each parameter: ∂w1∂L, ∂b1∂L, ∂w2∂L, and ∂b2∂L. These gradients tell us how a small change in each parameter affects the loss, guiding the learning process.
We can represent this network and the flow of calculations using a computational graph.
Computational graph showing the forward pass from input x to loss L. The backward pass involves calculating gradients by moving from L back towards the inputs and parameters.
Let's assign some specific values:
Now, calculate the network's output step-by-step:
The initial prediction is 0.4937, quite far from the target 1.0, resulting in a loss of 0.1282. Now we need the gradients to update the weights and biases to reduce this loss.
We compute gradients starting from the loss and moving backward through the network. Remember the derivative of the sigmoid function: σ′(z)=dzdσ(z)=σ(z)(1−σ(z)).
Gradient of Loss w.r.t. Output Activation (a2): ∂a2∂L=∂a2∂[21(a2−ytrue)2]=(a2−ytrue) Using our values: ∂a2∂L=0.4937−1.0=−0.5063
Gradient of Loss w.r.t. Output Layer Input (z2): Use the chain rule: ∂z2∂L=∂a2∂L∂z2∂a2 We need ∂z2∂a2=σ′(z2)=σ(z2)(1−σ(z2))=a2(1−a2) σ′(z2)≈0.4937(1−0.4937)≈0.4937×0.5063≈0.2500 So, ∂z2∂L=(−0.5063)×(0.2500)≈−0.1266
Gradient of Loss w.r.t. Output Weight (w2): Use the chain rule again: ∂w2∂L=∂z2∂L∂w2∂z2 We know z2=w2a1+b2, so ∂w2∂z2=a1. ∂w2∂L=∂z2∂L×a1≈(−0.1266)×(0.7503)≈−0.0950
Gradient of Loss w.r.t. Output Bias (b2): ∂b2∂L=∂z2∂L∂b2∂z2 Since z2=w2a1+b2, ∂b2∂z2=1. ∂b2∂L=∂z2∂L×1≈−0.1266
Gradient of Loss w.r.t. Hidden Activation (a1): This gradient is needed to propagate further back. ∂a1∂L=∂z2∂L∂a1∂z2 Since z2=w2a1+b2, ∂a1∂z2=w2. ∂a1∂L=∂z2∂L×w2≈(−0.1266)×(−0.3)≈0.0380
Gradient of Loss w.r.t. Hidden Layer Input (z1): ∂z1∂L=∂a1∂L∂z1∂a1 We need ∂z1∂a1=σ′(z1)=σ(z1)(1−σ(z1))=a1(1−a1) σ′(z1)≈0.7503(1−0.7503)≈0.7503×0.2497≈0.1874 So, ∂z1∂L=∂a1∂L×σ′(z1)≈(0.0380)×(0.1874)≈0.0071
Gradient of Loss w.r.t. Hidden Weight (w1): ∂w1∂L=∂z1∂L∂w1∂z1 Since z1=w1x+b1, ∂w1∂z1=x. ∂w1∂L=∂z1∂L×x≈(0.0071)×(2.0)≈0.0142
Gradient of Loss w.r.t. Hidden Bias (b1): ∂b1∂L=∂z1∂L∂b1∂z1 Since z1=w1x+b1, ∂b1∂z1=1. ∂b1∂L=∂z1∂L×1≈0.0071
We have now computed all the required gradients:
These gradients are the core output of the backpropagation step for this single data point.
These calculated gradients are exactly what we need for the update step in gradient descent (or its variants like SGD or mini-batch GD). For example, using standard gradient descent, the update rules would look like this (where η is the learning rate):
w1←w1−η∂w1∂L b1←b1−η∂b1∂L w2←w2−η∂w2∂L b2←b2−η∂b2∂L
Plugging in our calculated gradients (assuming η=0.1 for illustration):
w1←0.5−0.1×(0.0142)=0.5−0.00142=0.49858 b1←0.1−0.1×(0.0071)=0.1−0.00071=0.09929 w2←−0.3−0.1×(−0.0950)=−0.3+0.00950=−0.29050 b2←0.2−0.1×(−0.1266)=0.2+0.01266=0.21266
After just one update step, the parameters have been slightly adjusted in directions that are expected to reduce the loss for this specific input-output pair. Repeating this process (forward pass, backward pass, parameter update) over many data points and iterations allows the network to learn complex patterns.
Let's verify these calculations with a short Python snippet using basic math functions.
import math
# Sigmoid function and its derivative
def sigmoid(z):
return 1 / (1 + math.exp(-z))
def sigmoid_prime(z):
# Calculate using the output of sigmoid: sigma(z) * (1 - sigma(z))
s_z = sigmoid(z)
return s_z * (1 - s_z)
# Alternatively: return math.exp(-z) / ((1 + math.exp(-z))**2)
# Input and Target
x = 2.0
y_true = 1.0
# Initial Parameters
w1 = 0.5
b1 = 0.1
w2 = -0.3
b2 = 0.2
# --- Forward Pass ---
# Hidden Layer
z1 = w1 * x + b1
a1 = sigmoid(z1)
# Output Layer
z2 = w2 * a1 + b2
a2 = sigmoid(z2) # y_pred
# Loss
loss = 0.5 * (a2 - y_true)**2
print(f"--- Forward Pass ---")
print(f"z1: {z1:.4f}")
print(f"a1: {a1:.4f}")
print(f"z2: {z2:.4f}")
print(f"a2 (y_pred): {a2:.4f}")
print(f"Loss: {loss:.4f}\n")
# --- Backward Pass ---
# Gradients for Output Layer
dL_da2 = a2 - y_true
da2_dz2 = sigmoid_prime(z2) # Or use a2 * (1 - a2)
dL_dz2 = dL_da2 * da2_dz2
dL_dw2 = dL_dz2 * a1
dL_db2 = dL_dz2 * 1.0
# Gradients for Hidden Layer
dz2_da1 = w2
dL_da1 = dL_dz2 * dz2_da1
da1_dz1 = sigmoid_prime(z1) # Or use a1 * (1 - a1)
dL_dz1 = dL_da1 * da1_dz1
dL_dw1 = dL_dz1 * x
dL_db1 = dL_dz1 * 1.0
print(f"--- Backward Pass (Gradients) ---")
print(f"dL/da2: {dL_da2:.4f}")
print(f"dL/dz2: {dL_dz2:.4f}")
print(f"dL/dw2: {dL_dw2:.4f}")
print(f"dL/db2: {dL_db2:.4f}")
print(f"dL/da1: {dL_da1:.4f}")
print(f"dL/dz1: {dL_dz1:.4f}")
print(f"dL/dw1: {dL_dw1:.4f}")
print(f"dL/db1: {dL_db1:.4f}")
# --- Parameter Update Example (Learning Rate = 0.1) ---
lr = 0.1
w1_new = w1 - lr * dL_dw1
b1_new = b1 - lr * dL_db1
w2_new = w2 - lr * dL_dw2
b2_new = b2 - lr * dL_db2
print(f"\n--- Parameter Update (lr = {lr}) ---")
print(f"New w1: {w1_new:.5f}")
print(f"New b1: {b1_new:.5f}")
print(f"New w2: {w2_new:.5f}")
print(f"New b2: {b2_new:.5f}")
Running this code produces results that match our manual calculations (allowing for minor floating-point differences):
--- Forward Pass ---
z1: 1.1000
a1: 0.7503
z2: -0.0251
a2 (y_pred): 0.4937
Loss: 0.1282
--- Backward Pass (Gradients) ---
dL/da2: -0.5063
dL/dz2: -0.1266
dL/dw2: -0.0950
dL/db2: -0.1266
dL/da1: 0.0380
dL/dz1: 0.0071
dL/dw1: 0.0142
dL/db1: 0.0071
--- Parameter Update (lr = 0.1) ---
New w1: 0.49858
New b1: 0.09929
New w2: -0.29050
New b2: 0.21266
This example demonstrates the mechanical process of backpropagation. It's essentially a systematic application of the chain rule to compute how sensitive the final loss is to each parameter in the network, propagating these sensitivities backward layer by layer. While deep learning frameworks automate this, understanding the underlying calculus is fundamental to grasping how neural networks learn.
© 2025 ApX Machine Learning