Masterclass
Building models with billions, or even trillions, of parameters using datasets measured in terabytes presents formidable computational hurdles. As discussed previously, the scale of these models is directly linked to their capabilities, but this scale comes at a significant cost in terms of computational resources. Understanding these challenges is fundamental to appreciating the engineering required for LLM development. The primary constraints revolve around two key resources: memory and compute.
Training a large neural network requires storing several components in the memory of the hardware accelerators (typically GPUs or TPUs). For LLMs, the sheer size of these components often exceeds the memory capacity of a single device. Let's break down the main memory consumers:
Model Parameters: These are the weights and biases (W) that the model learns during training. For large models, this is often the most obvious memory requirement. If a model has N parameters and uses 32-bit floating-point precision (FP32), it requires N×4 bytes just to store the weights. A 100-billion parameter model, for instance, needs approximately 100×109×4=400 GB for its parameters alone. Even using 16-bit precision (FP16 or BF16), this still amounts to 200 GB, far exceeding the memory of typical individual accelerators (which might range from 16GB to 80GB).
Gradients: During backpropagation, we compute the gradient of the loss function with respect to each parameter (∇W). These gradients usually have the same dimensions and data type as the parameters themselves. Therefore, storing gradients requires another N×4 bytes (for FP32) or N×2 bytes (for FP16/BF16).
Optimizer States: Modern optimizers like Adam or AdamW maintain state information for each parameter to adapt the learning rate. Adam, for example, stores estimates of the first moment (momentum) and the second moment (variance) for each parameter. If these moments are stored in FP32, they require an additional N×4 bytes each. This means the optimizer states can easily double or triple the memory required just for the parameters and gradients (1×N for parameters, 1×N for gradients, 2×N for Adam states = 4×N total parameter-related memory in FP32).
Activations: Perhaps the most significant memory consumer, especially with large batch sizes and long sequences, are the intermediate activations produced during the forward pass. Every activation computed for a layer often needs to be stored until it's used during the backward pass to compute gradients. The memory required for activations (A) scales roughly with the batch size (b), sequence length (s), hidden dimension (h), and the number of layers (L). Specifically for Transformers, the attention mechanism's intermediate results (like the attention scores matrix which scales quadratically with sequence length, O(s2)) can be particularly memory-intensive. This memory footprint can significantly outweigh the memory needed for parameters, especially during training. Techniques like activation checkpointing (or gradient checkpointing) are often employed to trade compute for memory by recomputing activations during the backward pass instead of storing them all.
The combination of these components dictates the total memory requirement per accelerator.
Breakdown of typical memory consumers within a single accelerator during LLM training. Activations and optimizer states often require substantial memory beyond just the model parameters.
Because the total required memory often exceeds what's available on a single device, distributed training strategies that split the model and its associated states across multiple accelerators become necessary.
Training LLMs is computationally intensive, requiring vast numbers of floating-point operations (FLOPs).
Training FLOPs: The bulk of the computation occurs in matrix multiplications within the Transformer's feed-forward networks and self-attention mechanisms. Empirical studies, often referred to as scaling laws (e.g., Kaplan et al., 2020; Hoffmann et al., 2022), provide estimates for the computational cost. A common approximation is that training a model with N parameters on a dataset of D tokens requires roughly C≈6×N×D FLOPs. For models with hundreds of billions of parameters trained on trillions of tokens, this results in computation budgets measured in the ZettaFLOPs range (1021 FLOPs) or higher. Completing such computations requires large clusters of accelerators running for weeks or months.
Inference FLOPs: While generating a single token is much faster than training, autoregressive generation (producing tokens one after another) still requires significant computation. For a model with N parameters, generating one token takes approximately 2×N FLOPs (ignoring positional encodings, layer norms, etc.). Generating long sequences token-by-token can become latency-sensitive. Furthermore, the self-attention mechanism has a computational complexity of O(s2⋅h) per layer for a sequence of length s and hidden dimension h. While techniques like Key-Value (KV) caching mitigate redundant computation for past tokens during inference, the processing of the initial prompt still involves this quadratic cost relative to the prompt length.
As memory and compute demands force training across multiple accelerators (often hundreds or thousands), communication between these devices becomes a critical factor.
The efficiency of the interconnects (like NVLink between GPUs or InfiniBand/Ethernet between nodes) significantly impacts overall training throughput. Slow communication can lead to accelerators sitting idle while waiting for data, creating bottlenecks that limit scaling.
These computational challenges—memory capacity limits, massive FLOP requirements, and communication overhead—drive the need for the specialized hardware, software frameworks (like PyTorch with distributed support, DeepSpeed, Megatron-LM), and parallelization strategies that we will explore in detail throughout this course. Overcoming these hurdles is central to the engineering practice of building large language models.
© 2025 ApX Machine Learning