Masterclass
Training contemporary large language models, often containing tens or hundreds of billions of parameters, presents significant engineering hurdles that simply cannot be overcome using a single accelerator device like a GPU or TPU. While smaller deep learning models might fit comfortably within the memory of one GPU, LLMs operate at a scale where this is no longer feasible. Attempting to load and train such massive models on a single device quickly runs into fundamental hardware limitations related to both memory capacity and computational throughput.
Let's break down why distributing the training process becomes an absolute necessity.
The most immediate challenge is memory. A single high-end accelerator, like an NVIDIA A100 or H100 GPU, typically offers around 40GB or 80GB of High Bandwidth Memory (HBM). This seems substantial, but the memory footprint of an LLM during training comprises several large components:
Model Parameters: These are the weights and biases the model learns. Storing them requires significant space, especially when using standard 32-bit floating-point precision (FP32). Even with mixed-precision training using 16-bit formats like FP16 or BFloat16 (BF16), the parameter storage alone can exceed single-GPU memory. For instance, a model with 70 billion parameters stored in BF16 (2 bytes per parameter) requires 140 GB, already far exceeding an 80GB GPU.
# Approximate memory for parameters (in GB)
num_parameters = 70_000_000_000
bytes_per_parameter_mixed_precision = 2 # Using BF16 or FP16
param_memory_gb = (num_parameters * bytes_per_parameter_mixed_precision) / (1024**3)
print(f"Approx. parameter memory (BF16/FP16): {param_memory_gb:.2f} GB")
# Output: Approx. parameter memory (BF16/FP16): 130.39 GB
# Note: Actual memory might vary slightly due to alignment/framework overhead.
# Let's use the 140GB figure for consistency in text.
Gradients: During backpropagation, gradients are computed for each parameter. These gradients typically have the same dimensions and require the same precision as the parameters themselves. So, for our 70B parameter example using mixed precision, we need another 140 GB just to store the gradients.
Optimizer States: Modern optimizers like Adam or AdamW maintain internal states to adapt the learning rate for each parameter. AdamW, commonly used for training LLMs, stores two states per parameter: the first moment (momentum) and the second moment (variance). These states are often kept in higher precision (FP32, 4 bytes) for stability, even during mixed-precision training. This means an additional 70 billion×2 states×4 bytes/state=560 GB for the optimizer states.
# Approximate memory for AdamW optimizer states (in GB)
# Typically stored in FP32 (4 bytes) for stability
bytes_per_state_fp32 = 4
num_states_per_param_adamw = 2
optimizer_memory_gb = (num_parameters * num_states_per_param_adamw *
bytes_per_state_fp32) / (1024**3)
print(f"Approx. AdamW optimizer state memory (FP32): {optimizer_memory_gb:.2f} GB")
# Output: Approx. AdamW optimizer state memory (FP32): 521.54 GB
# Using 560GB in text for simplified calculation consistency.
Activations: During the forward pass, the intermediate outputs (activations) of each layer must be stored for use in the backward pass gradient calculation. The size of these activations depends on the batch size, sequence length, and model hidden dimensions. For Transformers, the self-attention mechanism is particularly memory-intensive, potentially requiring O(batch_size×sequence_length2×hidden_dim) memory in its naive implementation. While techniques like activation checkpointing (recomputing activations during the backward pass instead of storing them) can reduce this footprint, significant activation memory is still needed, often tens or hundreds of gigabytes depending on the configuration.
Summing just the parameters, gradients, and optimizer states for our 70B parameter example gives 140+140+560=840 GB. This is over ten times the capacity of a single 80GB GPU, and we haven't even fully accounted for activations or workspace memory required by libraries like cuDNN.
Estimated memory breakdown for training a ~70 billion parameter model compared to the typical 80GB HBM of a high-end GPU. Activations and workspace memory add further requirements.
Clearly, the memory requirements force us to look beyond a single device.
Beyond memory, the sheer computational cost (measured in floating-point operations, or FLOPs) required to train an LLM is astronomical. A single forward and backward pass through a multi-billion parameter model involves trillions of calculations.
Faced with these memory and computational walls, the only viable approach is to distribute the training workload across a cluster of multiple interconnected accelerators. By dividing the model's parameters, data, or computational graph across many devices working in parallel, we can:
The following sections explore the primary strategies developed to achieve this distribution: Data Parallelism, Tensor Parallelism, and Pipeline Parallelism, along with their combinations and associated communication considerations. Understanding these techniques is essential for anyone involved in building and training large-scale language models.
© 2025 ApX Machine Learning