Masterclass
As introduced earlier, the sheer scale of large language models places immense pressure not only on raw compute power but significantly on memory systems. Training models with tens or hundreds of billions of parameters requires careful consideration of both the capacity (how much data can be stored) and the bandwidth (how quickly data can be accessed) of the memory available on accelerator hardware like GPUs and TPUs. Insufficient memory capacity or bandwidth can become a major bottleneck, severely limiting model size and training efficiency.
The primary consumers of GPU memory (often called VRAM, Video Random Access Memory) during training are:
Consider a simplified calculation for parameter and optimizer memory. For a model with N parameters using 32-bit floating-point (FP32) precision (4 bytes per parameter), the memory for parameters alone is 4N bytes. If using an optimizer like AdamW, which stores two states per parameter (momentum and variance, typically also in FP32), the optimizer states require an additional 2×4N=8N bytes. Gradients add another 4N bytes. The total static memory for parameters, gradients, and AdamW states in FP32 is approximately 4N+8N+4N=16N bytes. For a 175 billion parameter model (like GPT-3), this alone is 16×175×109≈2.8 terabytes, far exceeding the memory of any single current GPU. This calculation highlights why mixed-precision training (using 16-bit formats like FP16 or BF16) and distributed techniques are essential.
Using mixed precision (e.g., BF16, 2 bytes per parameter) significantly reduces this static footprint. Parameters might be stored in BF16 (2N bytes), gradients calculated in BF16 (2N bytes), and optimizer states potentially kept in FP32 for stability (8N bytes), leading to approximately 2N+2N+8N=12N bytes. Even with this reduction, the memory requirement remains substantial.
import torch
# Example calculation (simplified)
num_params_billion = 175
bytes_per_bf16 = 2
bytes_per_fp32 = 4
# Calculate parameter size in bytes
param_memory_bf16 = num_params_billion * 1e9 * bytes_per_bf16
# Calculate gradient size (same precision as params often)
grad_memory_bf16 = num_params_billion * 1e9 * bytes_per_bf16
# Calculate optimizer state size (AdamW often uses FP32 states)
optimizer_memory_fp32 = 2 * num_params_billion * 1e9 * bytes_per_fp32
total_static_memory_mixed = (
param_memory_bf16
+ grad_memory_bf16
+ optimizer_memory_fp32
)
total_static_memory_tb = total_static_memory_mixed / (1024**4) # Convert bytes to Terabytes
print(
f"Approx. static memory (BF16 params/grads, FP32 AdamW states) for "
f"{num_params_billion}B params: {total_static_memory_tb:.2f} TB"
)
# Note: This excludes activations, which are highly variable.
# It also simplifies mixed-precision details.
Activation memory usage is often the most challenging component. In a Transformer, the size of activations stored per layer is roughly proportional to batch_size×sequence_length×hidden_size×num_layers. Because sequence lengths (L) in LLMs can be large (thousands of tokens) and batch sizes (B) are scaled up for efficiency, the O(B⋅L⋅dmodel​⋅Nlayers​) memory required for activations quickly dominates, especially when using standard backpropagation which requires caching activations from the forward pass. Techniques like activation checkpointing (or gradient checkpointing) are commonly used to mitigate this by recomputing activations during the backward pass instead of storing them, trading increased computation time for reduced memory usage.
Given the massive data requirements, simply having enough GPU RAM capacity is insufficient; the memory must also be fast enough to feed the hungry compute cores of the GPU. This is where High Bandwidth Memory (HBM) becomes significant.
HBM is a type of stacked DRAM technology designed to be placed very close to the GPU compute die using a silicon interposer. This proximity and a very wide communication bus (e.g., 1024 bits or more, compared to 256/384 bits for typical GDDR memory) allow for significantly higher memory bandwidth than traditional GDDR memory found on consumer graphics cards.
Modern compute GPUs used for large-scale training (like NVIDIA's A100 or H100 series) rely exclusively on HBM (specifically HBM2e or HBM3). The high bandwidth, often reaching 1.5-3 TB/s per GPU, is essential for several reasons:
Approximate peak memory bandwidth for different GPU memory technologies. HBM provides substantially higher bandwidth compared to GDDR, which is critical for LLM training performance.
While HBM provides superior bandwidth, its manufacturing complexity generally leads to higher costs and potentially lower capacities per dollar compared to GDDR. Therefore, selecting hardware involves balancing the need for high bandwidth (favoring HBM-equipped GPUs) with the total capacity required and budget constraints. The memory capacity per GPU (e.g., 40GB, 80GB, or more) directly dictates how large a model (or model shard, in distributed training) can fit, influencing the choice of parallelism strategies (Data Parallelism, Tensor Parallelism, Pipeline Parallelism, ZeRO) needed to accommodate the model and its associated states during training. Understanding these memory limitations is fundamental to designing efficient and feasible training setups for large language models.
© 2025 ApX Machine Learning