Hardware Requirements and Memory Constraints

Applying mathematical weight updates requires physical hardware memory. Specifically, these operations rely on the Video RAM (VRAM) of a Graphics Processing Unit (GPU). Understanding exactly how memory is consumed during training prevents out-of-memory errors and determines which models you can realistically train on your available hardware.

Memory consumption during language model training is divided into four main components. These are the model weights, the optimizer states, the gradients, and the activations. During inference, you only need enough memory to hold the model weights and a small amount of context data. Training, however, requires maintaining all four components simultaneously.

To estimate the memory required for the model weights, you multiply the number of parameters by the size of the data type used to store them. Most small language models are trained using 16-bit precision. This includes formats like FP16 or BF16. In a 16-bit format, each parameter occupies 2 bytes of memory. If $P$ represents the total number of parameters, the memory required for the weights is calculated as:

$M_{weights} = P \times 2 \text{ bytes}$

During the backward pass of gradient descent, the network calculates gradients for every weight. Because these gradients are also stored in 16-bit precision, their memory footprint mirrors that of the model weights.

$M_{gradients} = P \times 2 \text{ bytes}$

The optimizer states are the largest consumer of VRAM during full fine-tuning. The standard optimizer for training language models is AdamW. AdamW maintains two moving averages for each parameter. These are the momentum and the variance. To ensure numerical stability, these states are stored in 32-bit precision (FP32), where each value takes 4 bytes. Since there are two states per parameter, the optimizer requires 8 bytes per parameter.

$M_{optimizer} = P \times 8 \text{ bytes}$

We can combine these static memory requirements into a single formula for full fine-tuning. The total fixed memory required before passing any data through the model is:

$M_{fixed} = M_{weights} + M_{gradients} + M_{optimizer} = P \times 12 \text{ bytes}$

Proportion of VRAM allocated to different training components for a 1-billion parameter model using full fine-tuning and the AdamW optimizer.

For example, a small language model with 1.5 billion parameters would require 18 Gigabytes (GB) in fixed memory alone using the formula above.

This 18 GB calculation does not include the fourth component of memory consumption, which is the activations. Activations refer to the intermediate tensor representations created during the forward pass. These tensors are held in memory because they are required to compute the gradients during the backward pass.

The memory consumed by activations is highly variable. It scales based on your batch size and your maximum sequence length. In standard transformer architectures, the memory required for the self-attention mechanism grows quadratically with the sequence length. If you double the length of the text input, the attention memory requirement increases by a factor of four. For a 1.5 billion parameter model with a modest batch size, activations can easily consume an additional 4 to 6 GB of VRAM.

Adding the fixed states and the activations together brings the total VRAM requirement for a 1.5 billion parameter model to over 22 GB. A standard high-end consumer GPU typically offers 24 GB of VRAM. This means training a relatively small 1.5B model pushes the hardware to its absolute limit. If you attempted to apply full fine-tuning to a 7 billion parameter model, the fixed states alone would require over 84 GB of VRAM. This makes it impossible to train on a single consumer GPU.

Understanding these hardware limits emphasizes why full fine-tuning is often impractical for language models. It also establishes the technical motivation for using Parameter-Efficient Fine-Tuning methods. By selectively updating only a small fraction of the parameters, we drastically reduce the memory needed for gradients and optimizer states.

Before we introduce those efficient training techniques, you need a functional baseline. The next section will guide you through writing the Python code required to load a pre-trained model into memory within these hardware constraints.

References

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He, 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (IEEE) DOI: 10.1109/SC41405.2020.00024 - A foundational paper that breaks down memory consumption into model states (optimizer states, gradients, and parameters) and residual states.
Decoupled Weight Decay Regularization, Ilya Loshchilov, Frank Hutter, 2019 International Conference on Learning Representations DOI: 10.48550/arXiv.1711.05101 - The original paper defining the AdamW optimizer, which is the source of the 8 bytes per parameter memory requirement discussed in this section.
Methods and tools for efficient training on a single GPU, Hugging Face, 2023 - Technical documentation providing the mathematical breakdown of VRAM usage for training transformer models on standard hardware.
LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2022 International Conference on Learning Representations (ICLR) - Provides the theoretical basis for Parameter-Efficient Fine-Tuning (PEFT) as a solution to the memory constraints mentioned in the text.