All Courses

Mini-batch Gradient Descent

As we saw, standard (or batch) gradient descent computes the gradient using the entire dataset for each parameter update. This provides an accurate estimate of the gradient but becomes computationally expensive and slow for the large datasets common in deep learning. Stochastic Gradient Descent (SGD) goes to the other extreme, updating parameters after evaluating the gradient on just a single training example. This is much faster per update and can help escape shallow local minima, but the updates are very noisy, leading to a potentially erratic convergence path.

Is there a middle ground? Yes, and it's the most common approach used in practice: Mini-batch Gradient Descent.

Mini-batch Gradient Descent strikes a balance between the reliability of Batch GD and the efficiency of SGD. Instead of using the entire dataset or just one sample, it computes the gradient and updates the parameters based on a small subset of the training data, called a mini-batch.

How Mini-batch Gradient Descent Works

The core idea is straightforward:

Choose a mini-batch size: Select a number $b$ , the mini-batch size, typically a power of two (e.g., 32, 64, 128, 256) for hardware efficiency. This $b$ is much smaller than the total number of training samples $N$ , but larger than 1.
Shuffle the data: At the beginning of each epoch (one full pass through the training data), shuffle the dataset randomly. This prevents the model from being exposed to the same sequence of mini-batches in every epoch and helps ensure the gradient estimate is representative.
Iterate through mini-batches: Divide the shuffled dataset into mini-batches of size $b$ .
Compute gradient and update: For each mini-batch $\mathcal{B} = \{ (\mathbf{x}^{(i)}, y^{(i)}) \}_{i=1}^b$ $B = {(x^{(i)}, y^{(i)})}_{i = 1}^{b}$ :
- Calculate the average loss over the mini-batch: $L(\mathbf{W}; \mathcal{B}) = \frac{1}{b} \sum_{i=1}^b L(\mathbf{W}; \mathbf{x}^{(i)}, y^{(i)})$
- Compute the gradient of this average loss with respect to the parameters $\mathbf{W}$ : $\nabla_{\mathbf{W}} L(\mathbf{W}; \mathcal{B}) = \frac{1}{b} \sum_{i=1}^b \nabla_{\mathbf{W}} L(\mathbf{W}; \mathbf{x}^{(i)}, y^{(i)})$
- Update the parameters using this mini-batch gradient and the learning rate $\eta$ : $\mathbf{W} \leftarrow \mathbf{W} - \eta \nabla_{\mathbf{W}} L(\mathbf{W}; \mathcal{B})$

An epoch is completed once the algorithm has processed all the mini-batches covering the entire training dataset. Training typically involves many epochs.

Advantages of Mini-batch Gradient Descent

This approach offers several significant advantages:

Efficiency: Computing the gradient over a mini-batch is much faster than using the full dataset. Crucially, it allows us to leverage highly optimized matrix operations available on modern hardware like GPUs. Processing $b$ samples together is often much more efficient than processing $b$ samples one by one.
Smoother Convergence: By averaging gradients over $b$ samples, the variance of the parameter updates is reduced compared to SGD. This leads to a more stable convergence path and often allows for the use of slightly higher learning rates.
Frequent Updates: Unlike Batch GD, which updates only once per epoch, Mini-batch GD updates parameters multiple times per epoch (specifically, $N/b$ times). This generally leads to faster convergence towards a good solution.

The following visualization provides a comparison of the optimization paths these different gradient descent variants might take on a loss surface.

Optimization paths for Batch GD (smooth, direct), SGD (noisy), and Mini-batch GD (intermediate noise, more frequent updates) towards a minimum (not shown).

Considerations

While Mini-batch GD is the workhorse of deep learning optimization, there are a couple of points to keep in mind:

Mini-batch Size ( $b$ ): This is a new hyperparameter that needs to be chosen.
- Smaller batch sizes (e.g., 32) introduce more noise, which can sometimes help the optimizer escape poor local minima (acting as a form of regularization), but convergence might be slower or require smaller learning rates.
- Larger batch sizes (e.g., 256, 512) provide a more accurate gradient estimate, leading to smoother convergence and potentially allowing higher learning rates. However, they require more memory and may converge to sharper minima, which sometimes generalize less well. They also offer diminishing returns in computational speed, limited by hardware parallelism.
- Powers of 2 are often preferred as they align well with memory architectures in GPUs.
Gradient Noise: The gradient computed from a mini-batch is still an estimate, not the true gradient over the full dataset. While less noisy than SGD, this noise means the optimization path still isn't perfectly smooth. Techniques like Momentum (which we'll discuss next) are often used in conjunction with Mini-batch GD to help smooth out these updates.

In practice, Mini-batch Gradient Descent (often simply referred to as SGD in many deep learning library contexts, even though it uses mini-batches) forms the basis for most optimization algorithms used today. Deep learning frameworks like PyTorch and TensorFlow provide convenient DataLoader utilities that automatically handle shuffling and creating mini-batches for you during training.

# PyTorch training loop using mini-batches
import torch
from torch.utils.data import TensorDataset, DataLoader

# Assume X_train, y_train are tensors
# Assume model and criterion (loss function) are defined
# Assume optimizer is initialized (e.g., torch.optim.SGD)

batch_size = 64
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

num_epochs = 10

for epoch in range(num_epochs):
    model.train() # Set model to training mode
    running_loss = 0.0
    for inputs, labels in train_loader:
        # inputs and labels form a mini-batch

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {running_loss / len(train_loader)}")

# After training, evaluate on a validation set

This combination of efficiency and reasonably accurate gradient estimation makes Mini-batch GD a highly effective and widely adopted method for training deep neural networks. However, the basic mini-batch approach still faces challenges, such as navigating ravines or saddle points in the loss, which motivates the development of more advanced techniques like Momentum and adaptive methods.

Was this section helpful?