All Courses

Implementing the Training Step

Now that we have defined the network's architecture and initialized its parameters, we arrive at the core engine of the learning process: the training step. This step is executed repeatedly within the larger training loop for each batch of data. Its purpose is to adjust the network's weights and biases slightly based on the errors made on the current batch, gradually guiding the network toward better predictions.

Implementing a single training step involves a precise sequence of operations that brings together the concepts from previous chapters: forward propagation, loss calculation, backpropagation, and parameter updates via gradient descent. Let's break down this sequence.

The Sequence of Operations in a Training Step

Imagine you have a mini-batch of training examples ( $X_{batch}$ ) and their corresponding true labels ( $Y_{batch}$ ). A single training step processes this batch as follows:

Forward Propagation: The input data ( $X_{batch}$ ) is fed into the network. It passes through each layer, undergoing linear transformations (weighted sums plus biases) followed by activation functions. This generates the network's predictions ( $\hat{Y}_{batch}$ ) for the input batch. This is the process detailed in Chapter 3.
- Input: Data batch $X_{batch}$ , current weights $W$ , current biases $b$ .
- Process: Calculate layer outputs sequentially, $A^{[l]} = g^{[l]}(Z^{[l]}) = g^{[l]}(W^{[l]}A^{[l-1]} + b^{[l]})$ , where $A^{[0]} = X_{batch}$ .
- Output: Network predictions $\hat{Y}_{batch} = A^{[L]}$ (output of the final layer $L$ ).
Loss Calculation: The generated predictions ( $\hat{Y}_{batch}$ ) are compared against the actual target values ( $Y_{batch}$ ) using a chosen loss function (e.g., Mean Squared Error for regression, Cross-Entropy Loss for classification). This function quantifies how "wrong" the network's predictions were for this specific batch.
- Input: Predictions $\hat{Y}_{batch}$ , true labels $Y_{batch}$ .
- Process: Apply the loss function, $L = \text{LossFunction}(\hat{Y}_{batch}, Y_{batch})$ .
- Output: A single scalar value representing the average loss over the batch, $L$ .
Backward Propagation (Backpropagation): This is where the network learns from its mistakes. Starting from the calculated loss $L$ , we compute the gradient (derivative) of the loss with respect to every weight and bias in the network. Backpropagation uses the chain rule of calculus to efficiently calculate these gradients, starting from the output layer and moving backward through the hidden layers to the input layer.
- Input: The loss value $L$ , cached values from the forward pass (like activations $A^{[l]}$ and intermediate values $Z^{[l]}$ ).
- Process: Calculate gradients $\frac{\partial L}{\partial W^{[l]}}$ and $\frac{\partial L}{\partial b^{[l]}}$ for each layer $l$ , starting from $l=L$ and going down to $l=1$ .
- Output: Gradients for all parameters ( $\nabla_W L$ , $\nabla_b L$ ).
Parameter Update (Gradient Descent): Armed with the gradients, we adjust the network's weights and biases. The gradients tell us the direction of steepest ascent of the loss function. To minimize the loss, we move the parameters in the opposite direction of their respective gradients. The size of this step is controlled by the learning rate ( $\alpha$ ).
- Input: Current weights $W$ , current biases $b$ , calculated gradients $\frac{\partial L}{\partial W}$ , $\frac{\partial L}{\partial b}$ , learning rate $\alpha$ .
- Process: Apply the gradient descent update rule for each parameter in every layer: $W^{[l]} = W^{[l]} - \alpha \frac{\partial L}{\partial W^{[l]}}$ $b^{[l]} = b^{[l]} - \alpha \frac{\partial L}{\partial b^{[l]}}$
- Output: Updated weights $W_{new}$ and biases $b_{new}$ . These updated parameters will be used for the forward pass of the next training step.

This four-stage cycle forms the fundamental unit of work during training.

The cycle of a single training step: processing input, calculating loss, computing gradients, and updating network parameters.

Implementation

While deep learning frameworks handle the automatic differentiation for backpropagation and provide optimized gradient descent algorithms, understanding the logical flow is important. You can think of implementing a function like perform_training_step:

# Python-like pseudocode
def perform_training_step(X_batch, Y_batch, network_parameters, learning_rate):
    # network_parameters contains current W's and b's for all layers

    # 1. Forward Propagation
    predictions, forward_cache = forward_propagate(X_batch, network_parameters)

    # 2. Loss Calculation
    loss = calculate_loss(predictions, Y_batch)

    # 3. Backward Propagation
    gradients = backward_propagate(loss, forward_cache, network_parameters)
    # gradients contains dL/dW, dL/db for all layers

    # 4. Parameter Update
    updated_parameters = update_parameters(network_parameters, gradients, learning_rate)

    return updated_parameters, loss

# --- Helper function definitions ---
# def forward_propagate(X, params): ... returns predictions, cache
# def calculate_loss(Y_hat, Y): ... returns scalar loss
# def backward_propagate(loss, cache, params): ... returns gradients
# def update_parameters(params, grads, alpha): ... returns updated_params

In this pseudocode:

forward_cache would store intermediate values (like activations $A^{[l]}$ and pre-activations $Z^{[l]}$ ) needed for backpropagation.
gradients would be a collection of gradient matrices/vectors, one for each weight matrix and bias vector.
update_parameters applies the simple gradient descent rule shown earlier, although in practice more advanced optimizers (like Adam, discussed in Chapter 4) are often used.

Each time this perform_training_step function is called within the training loop (typically with a new batch of data), the network's parameters are nudged slightly closer to values that minimize the loss function. Repeating this process many times, over many batches and epochs, allows the network to learn complex patterns from the training data.

Was this section helpful?