Adapting massive pre-trained language models, often with billions of parameters encapsulated in weight matrices like W, to specialized downstream tasks through full fine-tuning seems like a direct approach. However, this process, which involves updating every single parameter in the model, comes with substantial resource demands. Let's break down the specific costs involved.
Training large models is almost always bottlenecked by the available memory on Graphical Processing Units (GPUs). Full fine-tuning requires storing several large tensors in GPU memory simultaneously:
Model Weights (W): The parameters themselves constitute the base memory requirement. For a model with N parameters, storing them in standard 32-bit floating-point precision (FP32) requires 4N bytes. Even using mixed precision with 16-bit formats like FP16 or BF16 still requires 2N bytes. For a model like GPT-3 with 175 billion parameters, this translates to 700GB in FP32 or 350GB in FP16/BF16, far exceeding the capacity of commonly available GPUs.
Gradients (∇W): During backpropagation, gradients are computed for every parameter. These gradients typically have the same dimensions and require the same amount of memory as the model weights themselves (another 4N bytes for FP32 or 2N bytes for FP16/BF16 if mixed precision is used for gradients).
Optimizer States: Modern optimizers like Adam or AdamW maintain state information to adapt the learning rate for each parameter. Adam, for instance, stores two moments for each parameter: the first moment (momentum) and the second moment (variance). If these moments are stored in FP32, this adds an additional 8N bytes (4N for each moment). Even using 8-bit optimizers still adds significant overhead. The total memory for parameters, gradients, and optimizer states using AdamW in FP32 can quickly reach 4N+4N+8N=16N bytes. Using mixed precision might reduce this to 2N+2N+8N=12N bytes (if moments are kept in FP32) or less if moments are also quantized.
Intermediate Activations: The forward pass generates activations for each layer. These activations are needed during the backward pass to compute gradients. The memory consumed by activations depends on the batch size, sequence length, model hidden dimension, and the number of layers. For large models and long sequences, activations can consume a substantial amount of memory, sometimes even exceeding the memory needed for weights and optimizer states. Techniques like activation checkpointing (or gradient checkpointing) can reduce this by recomputing activations during the backward pass instead of storing them, but this comes at the cost of increased computation time (typically around 30% more).
The combination of these components means that fine-tuning even moderately large models (e.g., 7-13 billion parameters) requires high-end GPUs with large memory capacities (e.g., 40GB or 80GB), often necessitating distributed training setups across multiple GPUs, which adds communication overhead and complexity.
Approximate memory requirements for full fine-tuning using mixed precision (FP16) for weights/gradients and standard Adam (FP32 states). Activation memory is illustrative and varies greatly with batch size and sequence length. Note that the total exceeds typical single-GPU capacity even for smaller models.
Beyond memory, the sheer number of computations required for full fine-tuning is enormous. Each training step involves:
The dominant cost comes from the matrix multiplications within the Transformer layers (self-attention and feed-forward networks). The number of floating-point operations (FLOPs) scales significantly with the number of parameters, the sequence length, and the batch size. Training large models for even a single epoch on a large dataset can take days or weeks, even on powerful multi-GPU clusters, incurring substantial energy and hardware costs.
Fine-tuning workflows typically involve saving model checkpoints periodically. A single checkpoint for a large model, storing the full set of weights W (and potentially optimizer states), can consume hundreds of gigabytes or even terabytes of disk space.
Furthermore, consider scenarios where you need to adapt a base model to multiple distinct tasks or datasets. Full fine-tuning requires storing a complete, independent copy of the massive model weights for each task. This quickly becomes unmanageable from a storage perspective and hinders efficient deployment where multiple specialized models might be needed. If a 70B parameter model checkpoint is ~140GB (in BF16), fine-tuning it for 10 different tasks would require 1.4TB of storage just for the final weights, ignoring intermediate checkpoints.
These memory, compute, and storage challenges associated with updating every parameter W clearly motivate the search for more efficient adaptation techniques. Parameter-Efficient Fine-Tuning (PEFT) methods aim to significantly reduce these costs while retaining high performance on downstream tasks, forming the core subject of this course.
© 2025 ApX Machine Learning