The enormous capabilities of modern large language models are intrinsically linked to their scale. Understanding how model performance relates to size, data, and computational effort is fundamental to tackling efficiency challenges. This relationship is often captured by empirical observations known as scaling laws.
Pioneering work, notably by Kaplan et al. (2020) and later refined by Hoffmann et al. (2022), established that LLM performance (typically measured by cross-entropy loss on a held-out dataset) often follows a power-law relationship with respect to three primary factors:
The core observation is that the test loss L scales as a power law with N, D, and C, provided they are not the bottleneck. For instance, when compute and data are abundant, the loss often scales with model size as:
L(N)≈(NNc)αNHere, Nc and αN are constants specific to the model architecture and training setup. Similar power laws exist for L(D) and L(C). These relationships imply that increasing model size, data, or compute yields diminishing but predictable returns in performance (lower loss).
Illustrative power-law relationship between test loss and model parameters/compute on a log-log scale. Real-world plots show similar trends, although noise and specific architectural choices can cause deviations.
An important refinement came with the "Chinchilla" scaling laws (Hoffmann et al., 2022). Previous work often trained large models on relatively smaller datasets (compared to the model size). The Chinchilla paper demonstrated that for a given compute budget C, the optimal allocation involves scaling both model size N and dataset size D roughly proportionally. Specifically, they found that for optimal performance under a fixed FLOP budget, models should be trained for significantly more tokens than previously thought common. A 70B parameter Chinchilla model, trained on 1.4 trillion tokens, outperformed the much larger 175B parameter GPT-3 (trained on 300 billion tokens) and the 530B parameter Gopher (also trained on 300 billion tokens) on many benchmarks.
This implies that achieving the best performance for a given computational budget requires a balanced increase in both model parameters and training data, rather than prioritizing model size alone. The optimal ratio suggests C≈6ND, meaning the compute budget is roughly six times the product of parameters and tokens processed.
Understanding scaling laws necessitates quantifying the computational costs involved.
Training Costs: Training LLMs is exceptionally computationally expensive. The primary cost is the vast number of matrix multiplications involved in the forward and backward passes. A rough estimate for the training FLOPs for a standard transformer is:
Ctrain≈6×N×DWhere:
For a model like GPT-3 (175B parameters) trained on 300B tokens, this translates to approximately 6×(175×109)×(300×109)≈3.15×1023 FLOPs. Performing these calculations requires large clusters of high-end accelerators (GPUs or TPUs) running for weeks or months, incurring substantial hardware and energy costs.
Inference Costs: While less than training, inference cost is also significant, especially when deployed at scale. For generating a single token, the FLOPs are approximately:
Cinference_per_token≈2×NGenerating a sequence of length L requires roughly 2×N×L FLOPs. However, inference is often bottlenecked by memory bandwidth, not just raw compute (FLOPs). This is because, for each generated token, the entire model's parameters (N parameters, often requiring 2N bytes in FP16/BF16) need to be read from memory (e.g., GPU HBM). The time taken is often dominated by this memory access rather than the computation itself, especially for large models where N is in the billions. We will examine these bottlenecks in more detail in the next section.
Memory Costs:
These scaling laws and associated costs directly motivate the need for optimization. Reducing model size (N) through techniques like pruning or distillation directly lowers both compute and memory requirements. Quantization reduces the memory footprint of parameters and activations and can enable faster computation on specialized hardware. Efficient fine-tuning methods (PEFT) reduce the cost of adapting large pre-trained models. Hardware acceleration techniques aim to improve the FLOPs/second and memory bandwidth available for these demanding computations. Understanding these fundamental scaling relationships provides the quantitative basis for evaluating the effectiveness of the optimization techniques we will cover throughout this course.
© 2025 ApX Machine Learning