All Courses

Training the Model: The fit Method

Okay, your deep learning model is now defined with its layers and activation functions. You've also compiled it, specifying the loss function to measure error and the optimizer (like Adam or SGD) that will guide the learning process by updating the model's weights. The significant next step is to actually train the model. This involves feeding it the prepared training data and letting the optimizer adjust the weights iteratively to minimize the chosen loss function.

Training a neural network is fundamentally an iterative process. We don't just show the model the data once. Instead, we repeatedly expose it to the data, allowing it to gradually learn the underlying patterns. Each iteration involves several steps:

Forward Pass: A batch of training data (inputs) is passed through the network layer by layer to generate predictions (outputs).
Loss Calculation: The model's predictions are compared against the actual target values (labels) using the loss function defined during compilation. This yields a single number representing how wrong the model was for that batch.
Backward Pass (Backpropagation): The calculated loss is used to compute the gradients of the loss with respect to each weight and bias in the network. This tells us how much each parameter contributed to the error and in which direction to adjust it.
Weight Update: The optimizer uses the computed gradients and its internal logic (e.g., incorporating learning rate, momentum) to update the model's weights and biases, aiming to reduce the loss in the next iteration.

This cycle repeats for many batches of data.

Deep learning frameworks provide convenient ways to manage this training loop. In Keras, this is often done using a single method called fit. In PyTorch, you typically write the loop explicitly, offering more fine-grained control. Regardless of the specific implementation, the core concepts remain the same, and you'll need to specify several important parameters:

Training Data

This is the input data (X_train) and corresponding target labels (y_train) that the model will learn from. We assume this data has already been preprocessed (e.g., scaled, reshaped) as discussed previously.

Epochs

An epoch represents one complete pass through the entire training dataset. If your dataset has 10,000 samples and you train for 10 epochs, the model will see each sample 10 times during training (though likely in different batches and orders).

Choosing the number of epochs is important:

Too few epochs: The model might not have sufficient exposure to the data to learn effectively, leading to underfitting.
Too many epochs: The model might start learning the noise and specific details of the training data too well, failing to generalize to new, unseen data. This is overfitting. We also waste computational resources. Techniques like early stopping (covered later) help mitigate this.

Batch Size

Instead of processing the entire dataset at once for each weight update (which can be computationally infeasible for large datasets), we typically divide the training data into smaller chunks called mini-batches. The batch size defines how many training samples are processed in each forward/backward pass before the model's weights are updated.

Batch Gradient Descent: Batch size = size of the entire training set. Computes the exact gradient, but is slow and memory-intensive.
Stochastic Gradient Descent (SGD): Batch size = 1. Updates weights after every sample. Fast iterations, but noisy gradients can make convergence unstable.
Mini-Batch Gradient Descent: Batch size is set between 1 and the dataset size (commonly powers of 2 like 32, 64, 128, 256). This is the most common approach, offering a balance between gradient accuracy, computational efficiency, and memory usage.

The choice of batch size affects:

Memory Requirements: Larger batches require more memory (GPU RAM).
Training Speed: Larger batches can sometimes be processed faster due to hardware parallelization, but each update takes longer to compute. Smaller batches mean more frequent updates per epoch.
Gradient Noise: Smaller batches introduce more noise into the gradient estimates, which can sometimes help escape local minima but might also hinder convergence. Larger batches provide more stable gradient estimates.

An epoch involves processing the entire dataset, typically split into mini-batches. For each batch, the model performs a forward pass, calculates loss, performs a backward pass (backpropagation), and updates its weights via the optimizer. This cycle repeats for all batches within the epoch, and then for multiple epochs.

Example: PyTorch Training Loop

Let's see how this looks in PyTorch. Assume you have your model, criterion (loss function, e.g., nn.CrossEntropyLoss), optimizer (e.g., optim.Adam), and a train_loader (a DataLoader that provides batches of data).

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Assume these are defined elsewhere:
# model: Your neural network model (subclass of nn.Module)
# criterion: Your loss function (e.g., nn.CrossEntropyLoss())
# optimizer: Your optimizer (e.g., optim.Adam(model.parameters(), lr=0.001))
# X_train_tensor, y_train_tensor: Your training data as PyTorch tensors
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # To run on GPU if possible

# --- Hyperparameters ---
BATCH_SIZE = 64
NUM_EPOCHS = 10

# --- Prepare DataLoader ---
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
# shuffle=True is important for training to ensure batches are different each epoch
train_loader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True) 

# Move model to the correct device (CPU or GPU)
# model.to(device) 

# --- Training Loop ---
model.train() # Set the model to training mode (important for layers like Dropout, BatchNorm)

print("Starting Training...")
for epoch in range(NUM_EPOCHS):
    running_loss = 0.0
    num_batches = len(train_loader)

    for i, batch in enumerate(train_loader):
        # 1. Get data from the batch and move to device
        # inputs, labels = batch[0].to(device), batch[1].to(device) 
        inputs, labels = batch # Assuming data is already on the correct device or CPU for simplicity

        # 2. Zero the parameter gradients (essential before backward pass)
        optimizer.zero_grad()

        # 3. Forward pass: Compute predictions
        outputs = model(inputs)

        # 4. Calculate loss
        loss = criterion(outputs, labels)

        # 5. Backward pass: Compute gradients
        loss.backward()

        # 6. Optimize: Update weights based on gradients
        optimizer.step()

        # Accumulate loss for reporting
        running_loss += loss.item() # .item() gets the scalar value from the loss tensor

        # Optional: Print progress periodically
        if (i + 1) % 100 == 0 or (i + 1) == num_batches: # Print every 100 mini-batches or at the end of epoch
             print(f'Epoch [{epoch + 1}/{NUM_EPOCHS}], Batch [{i + 1}/{num_batches}], Loss: {loss.item():.4f}')
             # Note: For a running average loss: {running_loss / (i + 1):.4f}

    epoch_loss = running_loss / num_batches
    print(f'Epoch [{epoch + 1}/{NUM_EPOCHS}] completed. Average Loss: {epoch_loss:.4f}')

print('Finished Training')

# --- Keras Equivalent ---
# For comparison, the Keras equivalent encapsulates this loop:
# history = model.fit(X_train_tensor.numpy(), y_train_tensor.numpy(),
#                     epochs=NUM_EPOCHS,
#                     batch_size=BATCH_SIZE,
#                     shuffle=True,
#                     verbose=2) # verbose controls how much info is printed
# print('Finished Training')

This PyTorch loop explicitly performs the steps outlined earlier: zeroing gradients, forward pass, loss calculation, backward pass, and optimizer step for each batch. The outer loop iterates through the specified number of epochs. Setting model.train() is important as some layers behave differently during training and evaluation. Keras abstracts this loop into the model.fit() call, managing the batch iteration, shuffling, and updates internally based on the epochs and batch_size arguments you provide.

Executing this training loop (either explicitly or via a method like fit) is where the learning happens. The model's parameters are adjusted based on the training data, aiming to minimize the loss function. However, simply running the loop isn't enough. We need to observe how the training is progressing to make informed decisions. The next section will cover how to monitor metrics like loss and accuracy during training to understand model behavior and diagnose potential problems.

Was this section helpful?