After preparing your data by scaling numerical features and encoding categorical ones, the next consideration is how to feed this data into the neural network during training. Training a neural network involves iteratively adjusting its weights and biases to minimize a loss function. This adjustment process typically uses gradient descent, which calculates how the loss changes with respect to each parameter.
A naive approach might be to calculate these gradients based on the entire training dataset in one go. This method, known as Batch Gradient Descent, computes the precise gradient of the loss function across all training examples before making a single update to the network's parameters. While mathematically sound, this approach presents significant practical challenges, especially with the large datasets common in machine learning:
- Computational Expense: Performing forward and backward propagation through millions or even billions of data points just to compute a single parameter update is extremely time-consuming. Training would take an impractically long time.
- Memory Limitations: Loading the entire dataset, along with intermediate activations and gradients needed for computation, might exceed the available RAM or, more critically, the memory (VRAM) of GPUs which are often used to accelerate training.
To overcome these limitations, we almost always process the data in smaller chunks, called batches or mini-batches.
Training with Batches
Instead of using the full dataset for each parameter update, mini-batch gradient descent processes a small, randomly selected subset of the training data at each step. Here's the breakdown of the terminology and process:
- Batch: A subset of the total training data. For example, if you have 10,000 training examples, a batch might consist of 32 or 64 examples.
- Batch Size: The number of training examples included in one batch. This is a hyperparameter you choose before training.
- Iteration: A single pass of processing one batch of data. This includes performing forward propagation, calculating the loss, performing backward propagation to get gradients, and updating the network's parameters based on that batch.
- Epoch: One complete pass through the entire training dataset. If your dataset has N examples and your batch size is B, then one epoch consists of N/B iterations.
The typical training process using batches looks like this:
- Shuffle: At the beginning of each epoch, randomly shuffle the entire training dataset. This randomization is important to ensure that each batch is representative and that the model doesn't learn patterns based on the order of data presentation.
- Iterate: Loop through the shuffled dataset, taking one batch at a time.
- Process Batch: For each batch:
- Perform the forward pass to get predictions.
- Calculate the loss based on the batch's predictions and true labels.
- Perform the backward pass (backpropagation) to compute the gradients of the loss with respect to the parameters, based only on the current batch.
- Update the network's weights and biases using these calculated gradients and a learning rate (this is the core of mini-batch gradient descent).
- Repeat: Continue processing batches until the entire dataset has been seen (one epoch is complete).
- Multiple Epochs: Repeat the entire process for a set number of epochs, or until a stopping criterion is met (e.g., the loss on a separate validation set stops improving).
Training process within one epoch using mini-batches. The full dataset is shuffled, then processed iteratively one batch at a time. Each iteration involves forward/backward passes and parameter updates based on that batch.
Choosing the Batch Size
The batch size (B) is a critical hyperparameter that influences training dynamics, computational efficiency, and model generalization. There's a trade-off:
- Smaller Batch Sizes (e.g., 1, 8, 16, 32):
- Pros: Require less memory. Parameter updates happen more frequently (more iterations per epoch). The inherent noise in the gradient estimation (since it's based on few samples) can help the optimizer escape poor local minima and potentially find flatter minima, which often generalize better. The extreme case, a batch size of 1, is known as Stochastic Gradient Descent (SGD).
- Cons: The noisy gradients can cause the loss to fluctuate significantly, making convergence less stable. It can be computationally inefficient as modern hardware (especially GPUs) is optimized for parallel processing, which is underutilized with very small batches.
- Larger Batch Sizes (e.g., 128, 256, 512+):
- Pros: Provide a more accurate estimate of the true gradient over the whole dataset, leading to smoother convergence. Can leverage hardware parallelism more effectively, potentially speeding up computation per epoch.
- Cons: Require significantly more memory (RAM and GPU VRAM). Updates happen less frequently. Research suggests large batches can sometimes lead the optimizer towards "sharp" minima, which might generalize less well to unseen data compared to the "flatter" minima often found with smaller batches.
Practical Considerations:
Commonly used batch sizes are powers of 2, such as 32, 64, 128, or 256. This is often due to hardware memory architectures and libraries being optimized for these sizes, leading to better computational efficiency. However, the optimal batch size depends heavily on the specific dataset, model architecture, and available hardware. It's usually determined through experimentation and hyperparameter tuning.
Larger batch sizes often result in a smoother loss decrease per iteration but may converge slightly slower in terms of wall-clock time or reach less optimal minima. Smaller batch sizes lead to noisier loss curves but can sometimes navigate the loss landscape more effectively. Note that the number of iterations per epoch differs significantly between batch sizes.
In summary, batching is a standard and necessary technique for training neural networks efficiently on large datasets. It balances computational feasibility, memory constraints, and training dynamics. By processing data in batches, we enable frequent parameter updates using manageable chunks of data, forming the foundation of the iterative learning process explored in subsequent chapters on backpropagation and gradient descent variants.