Okay, let's work through the backpropagation algorithm with a concrete example. As discussed, training a neural network involves adjusting its weights and biases to minimize a loss function. Backpropagation provides an efficient way to compute the gradient of the loss function with respect to every parameter in the network, which is exactly what gradient descent needs to perform these adjustments.
We'll use a very simple network: one input neuron, a single hidden layer with two neurons, and one output neuron. This is small enough to track all calculations manually.
1. Network Setup and Forward Pass
Let's define our simple network:
- Input: x=0.5
- Hidden Layer (2 neurons): Sigmoid activation function.
- Output Layer (1 neuron): Sigmoid activation function.
- Weights & Biases (initialized randomly):
- Input to Hidden: w1=0.1, w2=0.2
- Hidden Biases: bh1=0.1, bh2=0.1
- Hidden to Output: w3=0.3, w4=0.4
- Output Bias: bo=0.1
- Target Output: y=1.0
- Loss Function: Mean Squared Error (MSE): L=21(y^−y)2. (We use 21 for convenience, as its derivative cancels out the exponent 2).
- Activation Function: Sigmoid: σ(z)=1+e−z1. Its derivative is σ′(z)=σ(z)(1−σ(z)).
Let's perform the forward pass calculation step-by-step:
a. Input to Hidden Layer:
- Weighted sum for hidden neuron 1: zh1=w1⋅x+bh1=(0.1⋅0.5)+0.1=0.05+0.1=0.15
- Activation for hidden neuron 1: h1=σ(zh1)=σ(0.15)≈0.5374
- Weighted sum for hidden neuron 2: zh2=w2⋅x+bh2=(0.2⋅0.5)+0.1=0.1+0.1=0.20
- Activation for hidden neuron 2: h2=σ(zh2)=σ(0.20)≈0.5498
b. Hidden Layer to Output Layer:
- Weighted sum for output neuron: zo=w3⋅h1+w4⋅h2+bo=(0.3⋅0.5374)+(0.4⋅0.5498)+0.1
zo≈0.1612+0.2199+0.1=0.4811
- Final prediction (output activation): y^=σ(zo)=σ(0.4811)≈0.6180
c. Calculate Loss:
- Loss: L=21(y^−y)2=21(0.6180−1.0)2=21(−0.3820)2≈21(0.1459)≈0.0730
Our network currently predicts 0.6180, which is quite far from the target 1.0, resulting in a loss of 0.0730. Now, let's use backpropagation to find out how to adjust the weights and biases to reduce this loss.
2. Backward Pass: Calculating Gradients
The core idea is to use the chain rule from calculus to calculate ∂w∂L for every weight w (and similarly for biases b). We start from the end (the loss) and work backward.
a. Gradients for Output Layer Weights (w3,w4) and Bias (bo)
We need ∂w3∂L, ∂w4∂L, and ∂bo∂L. Let's break down ∂w3∂L using the chain rule:
∂w3∂L=∂y^∂L⋅∂zo∂y^⋅∂w3∂zo
Let's calculate each part:
-
∂y^∂L: Derivative of the loss w.r.t. the prediction.
L=21(y^−y)2⟹∂y^∂L=(y^−y)=0.6180−1.0=−0.3820
-
∂zo∂y^: Derivative of the output activation function w.r.t. its input (zo). Since y^=σ(zo), the derivative is σ′(zo)=σ(zo)(1−σ(zo)).
∂zo∂y^=y^(1−y^)=0.6180⋅(1−0.6180)=0.6180⋅0.3820≈0.2361
-
∂w3∂zo: Derivative of the output weighted sum w.r.t. w3.
zo=w3h1+w4h2+bo⟹∂w3∂zo=h1≈0.5374
Now, combine them:
∂w3∂L=(−0.3820)⋅(0.2361)⋅(0.5374)≈−0.0485
Similarly for w4: The only difference is the last term in the chain rule.
∂w4∂zo=h2≈0.5498
∂w4∂L=∂y^∂L⋅∂zo∂y^⋅∂w4∂zo=(−0.3820)⋅(0.2361)⋅(0.5498)≈−0.0496
And for the output bias bo:
∂bo∂zo=1
∂bo∂L=∂y^∂L⋅∂zo∂y^⋅∂bo∂zo=(−0.3820)⋅(0.2361)⋅(1)≈−0.0902
Often, the first two terms are combined into a single "delta" term for the output layer: δo=∂y^∂L⋅∂zo∂y^≈−0.0902. Then the gradients are simply ∂w3∂L=δo⋅h1, ∂w4∂L=δo⋅h2, and ∂bo∂L=δo.
b. Gradients for Hidden Layer Weights (w1,w2) and Biases (bh1,bh2)
Now we need to propagate the error backward to the hidden layer. We need ∂w1∂L, ∂w2∂L, ∂bh1∂L, and ∂bh2∂L. Let's focus on w1.
The chain rule path is longer:
∂w1∂L=∂y^∂L⋅∂zo∂y^⋅∂h1∂zo⋅∂zh1∂h1⋅∂w1∂zh1
We can reuse δo=∂y^∂L⋅∂zo∂y^≈−0.0902.
Let's calculate the new parts:
-
∂h1∂zo: Derivative of the output weighted sum w.r.t. the activation of the first hidden neuron.
zo=w3h1+w4h2+bo⟹∂h1∂zo=w3=0.3
-
∂zh1∂h1: Derivative of the first hidden neuron's activation w.r.t. its input. h1=σ(zh1), so the derivative is σ′(zh1)=h1(1−h1).
∂zh1∂h1=0.5374⋅(1−0.5374)=0.5374⋅0.4626≈0.2486
-
∂w1∂zh1: Derivative of the first hidden neuron's weighted sum w.r.t. w1.
zh1=w1x+bh1⟹∂w1∂zh1=x=0.5
Now, combine everything for ∂w1∂L:
∂w1∂L=(δo)⋅(∂h1∂zo)⋅(∂zh1∂h1)⋅(∂w1∂zh1)
∂w1∂L=(−0.0902)⋅(0.3)⋅(0.2486)⋅(0.5)≈−0.00337
Similarly for w2: The error contribution flows through h2.
∂w2∂L=∂y^∂L⋅∂zo∂y^⋅∂h2∂zo⋅∂zh2∂h2⋅∂w2∂zh2
- ∂h2∂zo=w4=0.4
- ∂zh2∂h2=h2(1−h2)=0.5498⋅(1−0.5498)=0.5498⋅0.4502≈0.2475
- ∂w2∂zh2=x=0.5
Combine them:
∂w2∂L=(δo)⋅(∂h2∂zo)⋅(∂zh2∂h2)⋅(∂w2∂zh2)
∂w2∂L=(−0.0902)⋅(0.4)⋅(0.2475)⋅(0.5)≈−0.00446
For the hidden biases bh1 and bh2: The last part of the chain rule changes to ∂bh1∂zh1=1 and ∂bh2∂zh2=1.
∂bh1∂L=(δo)⋅(∂h1∂zo)⋅(∂zh1∂h1)⋅(1)
∂bh1∂L=(−0.0902)⋅(0.3)⋅(0.2486)⋅(1)≈−0.00673
∂bh2∂L=(δo)⋅(∂h2∂zo)⋅(∂zh2∂h2)⋅(1)
∂bh2∂L=(−0.0902)⋅(0.4)⋅(0.2475)⋅(1)≈−0.00891
We can define delta terms for the hidden layer as well: δh1=δo⋅∂h1∂zo⋅∂zh1∂h1 and δh2=δo⋅∂h2∂zo⋅∂zh2∂h2. Then the gradients are ∂w1∂L=δh1⋅x, ∂w2∂L=δh2⋅x, ∂bh1∂L=δh1, ∂bh2∂L=δh2.
3. Visualizing the Flow
We can represent these calculations using a computational graph. Nodes represent values (input, weights, intermediate results, loss), and edges represent operations. Backpropagation involves calculating the gradient at each node by summing the gradients flowing in from its outgoing edges.
A computational graph showing the forward pass calculations for our simple network. Backpropagation computes gradients by traversing this graph backward from the loss (L). Solid arrows indicate the forward computation flow; gradients flow in the reverse direction along these paths during backpropagation.
4. Weight Update
With the gradients calculated, we can update the weights and biases using the gradient descent rule. Let's assume a learning rate η=0.1.
- w1new=w1−η∂w1∂L=0.1−(0.1⋅−0.00337)≈0.1003
- w2new=w2−η∂w2∂L=0.2−(0.1⋅−0.00446)≈0.2004
- bh1new=bh1−η∂bh1∂L=0.1−(0.1⋅−0.00673)≈0.1007
- bh2new=bh2−η∂bh2∂L=0.1−(0.1⋅−0.00891)≈0.1009
- w3new=w3−η∂w3∂L=0.3−(0.1⋅−0.0485)≈0.3049
- w4new=w4−η∂w4∂L=0.4−(0.1⋅−0.0496)≈0.4050
- bonew=bo−η∂bo∂L=0.1−(0.1⋅−0.0902)≈0.1090
After just one step, the parameters have been slightly adjusted in directions that should reduce the loss on the next forward pass with the same input.
5. Backpropagation in PyTorch
Manually calculating these gradients is instructive but quickly becomes impractical for larger networks. Deep learning frameworks like PyTorch automate this process using automatic differentiation (autograd). Here's how you might set up the same calculation:
import torch
# --- Setup ---
# Use FloatTensor for calculations
dtype = torch.float
# Use CPU for this simple example
device = torch.device("cpu")
# Define tensors for input, target, and parameters
# requires_grad=True tells PyTorch to track operations for gradient calculation
x = torch.tensor([[0.5]], device=device, dtype=dtype)
y = torch.tensor([[1.0]], device=device, dtype=dtype)
w1 = torch.tensor([[0.1]], device=device, dtype=dtype, requires_grad=True)
w2 = torch.tensor([[0.2]], device=device, dtype=dtype, requires_grad=True)
bh1 = torch.tensor([[0.1]], device=device, dtype=dtype, requires_grad=True)
bh2 = torch.tensor([[0.1]], device=device, dtype=dtype, requires_grad=True)
# Combine hidden weights and biases for matrix multiplication later (conceptual)
# For this simple example, we'll keep them separate like manual calc
w3 = torch.tensor([[0.3]], device=device, dtype=dtype, requires_grad=True)
w4 = torch.tensor([[0.4]], device=device, dtype=dtype, requires_grad=True)
bo = torch.tensor([[0.1]], device=device, dtype=dtype, requires_grad=True)
# --- Forward Pass ---
z_h1 = x @ w1.t() + bh1 # Using @ for matrix multiplication (though it's 1x1 here)
h1 = torch.sigmoid(z_h1)
z_h2 = x @ w2.t() + bh2
h2 = torch.sigmoid(z_h2)
# Combine hidden activations (conceptually forming a hidden layer output vector)
# h_combined = torch.cat((h1, h2), dim=1) # Would need w_ho as a matrix [2,1]
# For direct comparison, calculate z_o manually
z_o = h1 @ w3.t() + h2 @ w4.t() + bo
y_hat = torch.sigmoid(z_o)
# --- Loss Calculation ---
# Using built-in MSELoss is better, but manual calc for comparison
loss = 0.5 * (y_hat - y).pow(2)
loss = loss.sum() # Usually average loss over batch, here just sum scalar loss
print(f"Input x: {x.item():.4f}")
print(f"Target y: {y.item():.4f}")
print(f"h1: {h1.item():.4f}, h2: {h2.item():.4f}")
print(f"Prediction y_hat: {y_hat.item():.4f}")
print(f"Loss: {loss.item():.4f}")
# --- Backward Pass (Autograd) ---
# Computes gradients for all tensors with requires_grad=True
loss.backward()
# --- Check Gradients ---
# Gradients are stored in the .grad attribute of the tensors
print("\n--- Gradients ---")
print(f"dL/dw1: {w1.grad.item():.5f} (Manual: -0.00337)")
print(f"dL/dw2: {w2.grad.item():.5f} (Manual: -0.00446)")
print(f"dL/dbh1: {bh1.grad.item():.5f} (Manual: -0.00673)")
print(f"dL/dbh2: {bh2.grad.item():.5f} (Manual: -0.00891)")
print(f"dL/dw3: {w3.grad.item():.5f} (Manual: -0.04850)") # Small diff due to precision
print(f"dL/dw4: {w4.grad.item():.5f} (Manual: -0.04957)") # Small diff due to precision
print(f"dL/dbo: {bo.grad.item():.5f} (Manual: -0.09019)") # Small diff due to precision
# Note: Manual calculations used fewer decimal places, leading to tiny discrepancies.
# PyTorch uses higher precision.
This practical walk-through demonstrates how backpropagation meticulously applies the chain rule to find the contribution of each weight and bias to the overall loss. While frameworks handle the implementation, understanding this flow is fundamental to diagnosing training issues and designing effective network architectures. This calculated gradient information is then fed into optimization algorithms like SGD, Adam, or RMSprop (discussed next) to iteratively improve the model.