As we saw, standard (or batch) gradient descent computes the gradient using the entire dataset for each parameter update. This provides an accurate estimate of the gradient but becomes computationally expensive and slow for the large datasets common in deep learning. Stochastic Gradient Descent (SGD) goes to the other extreme, updating parameters after evaluating the gradient on just a single training example. This is much faster per update and can help escape shallow local minima, but the updates are very noisy, leading to a potentially erratic convergence path.
Is there a middle ground? Yes, and it's the most common approach used in practice: Mini-batch Gradient Descent.
Mini-batch Gradient Descent strikes a balance between the reliability of Batch GD and the efficiency of SGD. Instead of using the entire dataset or just one sample, it computes the gradient and updates the parameters based on a small subset of the training data, called a mini-batch.
The core idea is straightforward:
An epoch is completed once the algorithm has processed all the mini-batches covering the entire training dataset. Training typically involves many epochs.
This approach offers several significant advantages:
The following visualization provides a conceptual comparison of the optimization paths these different gradient descent variants might take on a hypothetical loss surface.
Optimization paths for Batch GD (smooth, direct), SGD (noisy), and Mini-batch GD (intermediate noise, more frequent updates) towards a minimum (not shown).
While Mini-batch GD is the workhorse of deep learning optimization, there are a couple of points to keep in mind:
In practice, Mini-batch Gradient Descent (often simply referred to as SGD in many deep learning library contexts, even though it uses mini-batches) forms the basis for most optimization algorithms used today. Deep learning frameworks like PyTorch and TensorFlow provide convenient DataLoader
utilities that automatically handle shuffling and creating mini-batches for you during training.
# Conceptual PyTorch training loop using mini-batches
import torch
from torch.utils.data import TensorDataset, DataLoader
# Assume X_train, y_train are tensors
# Assume model and criterion (loss function) are defined
# Assume optimizer is initialized (e.g., torch.optim.SGD)
batch_size = 64
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
num_epochs = 10
for epoch in range(num_epochs):
model.train() # Set model to training mode
running_loss = 0.0
for inputs, labels in train_loader:
# inputs and labels form a mini-batch
# Zero the parameter gradients
optimizer.zero_grad()
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, labels)
# Backward pass and optimize
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f"Epoch {epoch+1}, Loss: {running_loss / len(train_loader)}")
# After training, evaluate on a validation set
This combination of efficiency and reasonably accurate gradient estimation makes Mini-batch GD a highly effective and widely adopted method for training deep neural networks. However, the basic mini-batch approach still faces challenges, such as navigating ravines or saddle points in the loss landscape, which motivates the development of more advanced techniques like Momentum and adaptive methods.
© 2025 ApX Machine Learning