Now that we have defined the network's architecture and initialized its parameters, we arrive at the core engine of the learning process: the training step. This step is executed repeatedly within the larger training loop for each batch of data. Its purpose is to adjust the network's weights and biases slightly based on the errors made on the current batch, gradually guiding the network toward better predictions.
Implementing a single training step involves a precise sequence of operations that brings together the concepts from previous chapters: forward propagation, loss calculation, backpropagation, and parameter updates via gradient descent. Let's break down this sequence.
Imagine you have a mini-batch of training examples (Xbatch) and their corresponding true labels (Ybatch). A single training step processes this batch as follows:
Forward Propagation: The input data (Xbatch) is fed into the network. It passes through each layer, undergoing linear transformations (weighted sums plus biases) followed by activation functions. This generates the network's predictions (Y^batch) for the input batch. This is the process detailed in Chapter 3.
Loss Calculation: The generated predictions (Y^batch) are compared against the actual target values (Ybatch) using a chosen loss function (e.g., Mean Squared Error for regression, Cross-Entropy Loss for classification). This function quantifies how "wrong" the network's predictions were for this specific batch.
Backward Propagation (Backpropagation): This is where the network learns from its mistakes. Starting from the calculated loss L, we compute the gradient (derivative) of the loss with respect to every weight and bias in the network. Backpropagation uses the chain rule of calculus to efficiently calculate these gradients, starting from the output layer and moving backward through the hidden layers to the input layer.
Parameter Update (Gradient Descent): Armed with the gradients, we adjust the network's weights and biases. The gradients tell us the direction of steepest ascent of the loss function. To minimize the loss, we move the parameters in the opposite direction of their respective gradients. The size of this step is controlled by the learning rate (α).
This four-stage cycle forms the fundamental unit of work during training.
The cycle of a single training step: processing input, calculating loss, computing gradients, and updating network parameters.
While deep learning frameworks handle the automatic differentiation for backpropagation and provide optimized gradient descent algorithms, understanding the logical flow is important. Conceptually, you can think of implementing a function like perform_training_step
:
# Conceptual Python-like pseudocode
def perform_training_step(X_batch, Y_batch, network_parameters, learning_rate):
# network_parameters contains current W's and b's for all layers
# 1. Forward Propagation
predictions, forward_cache = forward_propagate(X_batch, network_parameters)
# 2. Loss Calculation
loss = calculate_loss(predictions, Y_batch)
# 3. Backward Propagation
gradients = backward_propagate(loss, forward_cache, network_parameters)
# gradients contains dL/dW, dL/db for all layers
# 4. Parameter Update
updated_parameters = update_parameters(network_parameters, gradients, learning_rate)
return updated_parameters, loss
# --- Helper function definitions (conceptual) ---
# def forward_propagate(X, params): ... returns predictions, cache
# def calculate_loss(Y_hat, Y): ... returns scalar loss
# def backward_propagate(loss, cache, params): ... returns gradients
# def update_parameters(params, grads, alpha): ... returns updated_params
In this pseudocode:
forward_cache
would store intermediate values (like activations A[l] and pre-activations Z[l]) needed for backpropagation.gradients
would be a collection of gradient matrices/vectors, one for each weight matrix and bias vector.update_parameters
applies the simple gradient descent rule shown earlier, although in practice more advanced optimizers (like Adam, discussed in Chapter 4) are often used.Each time this perform_training_step
function is called within the training loop (typically with a new batch of data), the network's parameters are nudged slightly closer to values that minimize the loss function. Repeating this process many times, over many batches and epochs, allows the network to learn complex patterns from the training data.
© 2025 ApX Machine Learning