Training a neural network involves iteratively adjusting its weights and biases to minimize a loss function. As introduced earlier, this minimization is typically achieved using optimization algorithms like gradient descent, which rely on knowing the gradient of the loss function with respect to the network's parameters. Calculating these gradients efficiently across potentially many layers is the job of the backpropagation algorithm.
At the heart of each training iteration are two distinct phases: the forward pass and the backward pass. Understanding these two phases is fundamental to grasping how networks learn.
The Forward Pass: Generating Predictions and Calculating Loss
The forward pass is the process of feeding input data through the network to generate an output prediction. It's essentially the network performing its primary function: making an inference based on the input.
- Input Propagation: The process starts when a batch of input data X is presented to the input layer.
- Layer-by-Layer Computation: The data flows sequentially through the network's layers (input, hidden, output). At each layer l, the following typically happens:
- A linear combination is computed: z[l]=W[l]a[l−1]+b[l] where a[l−1] is the activation from the previous layer (or the input X for the first hidden layer), W[l] are the weights, and b[l] is the bias for the current layer l.
- An activation function g[l] is applied to introduce non-linearity: a[l]=g[l](z[l])
- Output Generation: The final layer produces the network's output, often denoted as y^ or a[L] (where L is the last layer). This output represents the network's prediction for the given input X.
- Loss Calculation: The forward pass concludes by comparing the network's prediction y^ with the true target values y using a predefined loss function L(y^,y). This loss value quantifies how well (or poorly) the network performed on this specific batch of data.
During the forward pass, intermediate values like the weighted sums z[l] and activations a[l] at each layer are often stored or cached. These values are needed for the subsequent backward pass.
The Backward Pass: Computing and Propagating Gradients
Once the forward pass is complete and the loss L is calculated, the backward pass begins. Its objective is to determine how much each weight W[l] and bias b[l] in the network contributed to the calculated loss. This is achieved by computing the gradients of the loss function with respect to these parameters.
- Starting Point: The process starts at the very end of the network with the computed loss L. The initial gradient is the derivative of the loss with respect to itself, which is simply 1, or more formally, the derivative of the loss with respect to the final output activation ∂a[L]∂L.
- Applying the Chain Rule: The core of the backward pass is the repeated application of the calculus chain rule. It allows us to compute the gradients layer by layer, moving backward from the output layer towards the input layer. For a given layer l, we want to compute ∂W[l]∂L and ∂b[l]∂L. To do this, we first compute the gradient of the loss with respect to the layer's output activation ∂a[l]∂L, then with respect to its weighted sum ∂z[l]∂L. The chain rule helps relate these gradients:
∂z[l]∂L=∂a[l]∂L∂z[l]∂a[l]=∂a[l]∂Lg′[l](z[l])
And using ∂z[l]∂L, we can find the gradients with respect to the parameters of layer l:
∂W[l]∂L=∂z[l]∂L∂W[l]∂z[l]=∂z[l]∂L(a[l−1])T
∂b[l]∂L=∂z[l]∂L∂b[l]∂z[l]=∂z[l]∂L
Crucially, the gradient needed to continue the backward pass to the previous layer (l−1) is also derived:
∂a[l−1]∂L=∂z[l]∂L∂a[l−1]∂z[l]=(W[l])T∂z[l]∂L
- Gradient Accumulation: This process propagates the error signal (the gradient) backward through the network. Each layer uses the incoming gradient ∂a[l]∂L (computed from the layer ahead of it) and the intermediate values (z[l], a[l−1]) stored during the forward pass to calculate the gradients for its own parameters (W[l], b[l]) and the gradient ∂a[l−1]∂L to pass further backward.
- Gradient Usage: The final result of the backward pass is the gradient of the loss function with respect to all trainable parameters in the network (∇θL where θ represents all W and b). These gradients are exactly what optimization algorithms like Gradient Descent, Adam, or RMSprop need to update the network's parameters in a way that reduces the loss. The update rule generally follows the form:
θnew=θold−η∇θL
where η is the learning rate.
Diagram showing the flow of information during the forward pass (solid blue/red arrows) and backward pass (dashed orange arrows) in a simple neural network. The forward pass computes predictions and loss, while the backward pass computes gradients used for parameter updates.
In summary, the forward and backward passes are the two essential stages of a single training iteration for a neural network:
- Forward Pass: Computes the prediction and the resulting loss.
- Backward Pass: Computes the gradients of the loss with respect to the network's parameters.
These two steps, executed repeatedly over batches of training data, allow the network to learn by progressively adjusting its weights and biases based on the calculated gradients, guided by an optimization algorithm. The detailed mechanics of how the backward pass efficiently calculates these gradients using computational graphs and the chain rule will be explored next.