Masterclass
As we start scaling Transformer models, a natural question arises: how do we best invest our computational resources? Should we prioritize adding more layers, widening the hidden dimensions, or training on significantly more data? Simply increasing parameters without a clear strategy can lead to diminishing returns or inefficient use of expensive compute cycles. Fortunately, empirical studies have revealed predictable relationships between model performance, model size, dataset size, and the amount of computation used for training. These relationships are often referred to as "scaling laws".
Understanding these scaling laws provides a valuable framework for making informed decisions about model architecture and training regimes. They allow us to estimate the expected performance gains from scaling up different aspects of the training process and help optimize the allocation of our computational budget.
Pioneering work, notably by Kaplan et al. (2020) from OpenAI, demonstrated that the performance of language models, typically measured by the cross-entropy loss on a held-out test set, improves predictably as we scale up model size (number of non-embedding parameters, N), dataset size (number of training tokens, D), and compute (total floating-point operations, FLOPs, C).
The core finding is that the test loss L often follows a power-law relationship with N, D, and C, when other factors are not the bottleneck. For model size N and dataset size D, the relationship can often be modeled as:
L(N,D)≈L∞​+(NNc​​)αN​+(DDc​​)αD​Here:
This formula suggests that the loss is dominated by the term corresponding to the limiting factor: if the model is too small (N≪Nc​), increasing N helps significantly; if the dataset is too small (D≪Dc​), increasing D is more beneficial.
Similarly, the loss often scales as a power law with the compute budget C:
L(C)≈L∞​+(CCc​​)αC​Where Cc​ and αC​ are again empirically determined constants. These relationships typically hold over several orders of magnitude, making them useful for extrapolation.
A log-log plot illustrating how test loss might decrease as model size increases, often following a predictable power law over several orders of magnitude.
A significant refinement to understanding scaling laws came from Hoffmann et al. (2022) at DeepMind, known as the "Chinchilla" study. They performed a careful analysis to determine the optimal allocation of a fixed compute budget C between model size N and data size D.
The approximate relationship between compute, model size, and data size for training dense transformer models is often estimated as:
C≈6×N×DThis reflects that the computational cost is roughly proportional to the number of parameters multiplied by the number of tokens processed. Given a fixed compute budget C, the Chinchilla study suggested that for optimal performance (lowest loss), both N and D should be scaled roughly in proportion to the square root of the compute budget. That is, if you double the compute budget, you should aim to increase both model size and dataset size by a factor of about 2​≈1.4.
This finding was important because it suggested that many large language models trained prior to this study (like GPT-3 or Gopher) might have been "over-parameterized" relative to their training data. For the compute budgets used, better performance might have been achieved with smaller models trained on significantly more data. The Chinchilla model itself, trained according to these "compute-optimal" principles, achieved state-of-the-art results with fewer parameters but more training data than some of its larger contemporaries.
Let's illustrate with a simplified calculation. Suppose we fit a scaling law and find optimal performance occurs when N≈kC​ and D≈6NC​≈6kC​C​=6kC​​ for some constant k. If we have a compute budget C1​ and find the optimal N1​,D1​, then for a larger budget C2​=4C1​, the optimal parameters would be approximately N2​≈k4C1​​=2N1​ and D2​≈6k4C1​​​=2D1​. We scale both model and data proportionally to C​.
These scaling laws have several direct consequences for engineers building large models:
We can use the C≈6ND formula to estimate the computational cost of training. For instance, training a 7 billion parameter model (N=7×109) on 1 trillion tokens (D=1×1012) would require approximately:
import math
# Approximate number of parameters (excluding embeddings, simplified)
N = 7e9
# Number of training tokens
D = 1e12
# Estimated FLOPs using the 6ND rule
C_flops = 6 * N * D
# Convert FLOPs to Petaflop-days (1 Petaflop = 1e15 FLOP/s)
# Seconds in a day = 86400
petaflops = 1e15
seconds_per_day = 86400
C_petaflop_days = C_flops / (petaflops * seconds_per_day)
print(f"Estimated Compute: {C_flops:.2e} FLOPs")
print(f"Estimated Compute: {C_petaflop_days:.2f} Petaflop-days")
# Output:
# Estimated Compute: 4.20e+22 FLOPs
# Estimated Compute: 486.11 Petaflop-days
This calculation highlights the immense computational scale involved. A cluster capable of sustained 10 Petaflops would take roughly 49 days for such a run, ignoring overheads and potential inefficiencies. This underscores why efficient resource allocation guided by scaling laws is so important.
While powerful, it's necessary to remember the context of these scaling laws:
In summary, scaling laws provide an invaluable quantitative lens through which to view the process of building larger and more capable language models. They transform scaling from a guessing game into a more predictable engineering discipline, enabling more efficient use of computational resources and providing a framework for assessing progress in the field. As we explore specific architectural choices in the following sections, keep these scaling principles in mind as they often motivate the design decisions made for large-scale Transformers.
Was this section helpful?
© 2025 ApX Machine Learning