All Courses

Batches and Epochs

When training a neural network, especially with large datasets, processing the entire dataset at once to compute the loss and update the weights can be computationally expensive and memory-intensive. Furthermore, using the entire dataset for each weight update (as in traditional batch gradient descent) might lead to slower convergence or getting stuck in local minima. To address this, the training process is typically broken down into smaller, manageable steps using the concepts of batches and epochs.

Epochs: Full Passes Through the Data

An epoch represents one complete pass through the entire training dataset. If your dataset contains 10,000 images, one epoch concludes after the model has seen and learned from all 10,000 images exactly once.

Training a deep learning model usually requires multiple epochs. Why? Because a single pass is rarely enough for the model's weights to converge to optimal values. The network needs to see the data multiple times to learn the underlying patterns effectively. Think of it like studying for an exam: you wouldn't just read the textbook once; you'd review the material multiple times (multiple epochs) to reinforce your understanding.

The number of epochs is a hyperparameter you set before training begins. Choosing the right number is important:

Too few epochs: The model might underfit, meaning it hasn't learned the patterns in the data well enough.
Too many epochs: The model might overfit, meaning it learns the training data too well, including its noise, and performs poorly on new, unseen data. We'll discuss monitoring for overfitting using validation data in the next section.

Batches: Processing Data in Chunks

Instead of processing the entire dataset in one go during an epoch, we divide the dataset into smaller subsets called batches. The batch size determines how many training examples are included in each batch.

During each epoch, the training data is shuffled (usually) and then divided into these batches. The model processes one batch at a time:

Input the batch data into the network (forward pass).
Calculate the loss based on the predictions for that batch.
Calculate the gradients of the loss with respect to the model weights (backward pass).
Update the model weights using the chosen optimizer based on these gradients.

This process repeats for all batches within the epoch. Each time the model processes a batch and updates its weights, it's called one iteration or step.

For example, if you have a dataset with 2,000 samples and you set the batch size to 100, then one epoch will consist of: $\text{Number of Iterations per Epoch} = \frac{\text{Total Training Samples}}{\text{Batch Size}} = \frac{2000}{100} = 20 \text{ iterations}$

The model's weights will be updated 20 times during one epoch.

An illustration showing how a full training dataset is divided into batches within a single epoch. Each batch is processed sequentially, leading to a weight update (iteration).

Why Use Batches?

Using batches (often called mini-batch gradient descent) offers several advantages over processing the entire dataset at once (batch gradient descent) or one sample at a time (stochastic gradient descent, or SGD, though often SGD refers to mini-batch SGD in practice):

Memory Efficiency: Processing the entire dataset might require more memory (RAM or GPU VRAM) than available. Batches allow you to work with large datasets by loading only a portion at a time.
Computational Efficiency: Hardware like GPUs are optimized for parallel computations. Processing a batch of data uses this parallelism better than processing single samples sequentially. While processing the whole dataset at once might seem parallelizable, the memory constraints often make mini-batches faster in practice.
Gradient Estimation: The gradient calculated from a mini-batch is an estimate of the true gradient (the gradient calculated over the entire dataset). This estimate is noisy, which can actually be beneficial. The noise can help the optimizer escape sharp local minima and potentially find better, flatter minima that generalize better to unseen data.
Faster Updates: Weights are updated more frequently (after each batch) compared to batch gradient descent (after the entire dataset). This often leads to faster convergence, although each update is based on less information.

Choosing the Batch Size

The batch size is another significant hyperparameter. Common batch sizes are powers of 2 (e.g., 32, 64, 128, 256) due to hardware memory alignment optimizations, but other values can work too. The choice involves trade-offs:

Small Batch Size (e.g., 1, 8, 16, 32):
- Pros: Noisy gradient updates can help escape local minima (regularizing effect). Requires less memory. Can sometimes lead to faster initial convergence per epoch (more updates).
- Cons: Gradient estimates are very noisy, leading to unstable training and potentially slow convergence in terms of wall-clock time. Less computationally efficient due to underutilization of hardware parallelism.
Large Batch Size (e.g., 128, 256, 512+):
- Pros: More accurate gradient estimates lead to stabler convergence. Computationally efficient, maximizing hardware parallelism.
- Cons: Requires more memory. Can converge to sharp minima, which might generalize less well than the flatter minima often found by smaller batches. May require more epochs to reach the same level of performance as smaller batches, even if each epoch takes less time.

In practice, a batch size like 32 or 64 is often a good starting point. You might experiment with different sizes as part of hyperparameter tuning.

Specifying in Keras

When you call the fit() method in Keras, you specify these parameters:

# Assume 'model' is compiled and 'x_train', 'y_train' are NumPy arrays
history = model.fit(
    x_train,
    y_train,
    batch_size=32,  # Number of samples per gradient update
    epochs=10,      # Number of times to iterate over the entire dataset
    validation_data=(x_val, y_val) # Data to evaluate loss and metrics on at the end of each epoch
)

Here, the model will train for 10 epochs. In each epoch, it will process the x_train data in batches of 32 samples, updating the weights after each batch. The number of iterations per epoch would be len(x_train) / 32.

Understanding batches and epochs is fundamental to controlling the training process. They dictate how the model learns from the data over time, influencing training stability, speed, memory usage, and ultimately, the model's generalization performance.

Was this section helpful?