Training a neural network using the standard supervised approach requires calculating gradients and updating every single parameter within the model architecture. While this method works well for smaller networks, applying it to modern language models introduces severe hardware limitations. Even Small Language Models containing between one and seven billion parameters require significant computational resources that quickly exceed the capacity of consumer hardware.
To understand why full fine-tuning is so expensive, you must look at the mathematical realities of GPU memory allocation during training. Video RAM is not just used to store the model weights. The memory footprint is divided into four main components: model parameters, gradients, optimizer states, and forward activations.
When you load a model in 16-bit precision, each parameter requires two bytes of memory. For a 7 billion parameter model, the weights alone consume approximately 14 gigabytes of VRAM. During the backward pass, the network calculates gradients for every parameter to determine the direction and magnitude of the weight updates. Storing these gradients in 16-bit precision requires another 14 gigabytes.
The largest memory bottleneck comes from the optimizer. The standard optimizer used in training language models is AdamW. It maintains two distinct states for every parameter: a moving average of the gradient and a moving average of the squared gradient. To maintain numerical stability, these states are typically stored in 32-bit floating-point precision, requiring four bytes each. This means the optimizer states demand eight bytes per parameter.
Memory allocation breakdown per parameter during standard full fine-tuning with the AdamW optimizer.
You can express the baseline memory requirement in bytes for full fine-tuning as a simple equation, where represents the total number of model parameters:
For a 7 billion parameter model, calculating reveals a minimum memory requirement of 84 gigabytes just to hold the weights, gradients, and optimizer states in memory. This calculation does not even include the forward activations, which scale with the sequence length of your training data and the batch size. Consequently, training a model of this size requires specialized hardware setups, such as multiple A100 GPUs connected via high-speed communication links.
In addition to the immediate hardware limitations, full fine-tuning introduces significant storage and deployment problems. When you update every weight in the network, the resulting model is completely distinct from the base model. If you are developing multiple specialized applications, perhaps one model for summarizing medical documents and another for generating Python code, full fine-tuning requires you to save and host entirely separate copies of the model. For a 7 billion parameter model, this means allocating 14 gigabytes of disk space and deployment memory for every single task.
This linear scaling of model copies creates an inefficient deployment architecture. Loading multiple large files into memory simultaneously is impractical for local environments or small-scale cloud servers. Furthermore, fine-tuning all parameters on small, specialized datasets often leads to catastrophic forgetting. The model might adapt perfectly to the new task but lose the general language reasoning capabilities it acquired during its initial pre-training phase.
These combined hardware, memory, and storage limitations make full fine-tuning inaccessible and inefficient for most local development scenarios. By understanding exactly how memory is distributed, particularly into optimizer states and gradients, you can see why techniques that freeze the base model parameters are necessary. Leaving the majority of weights untouched eliminates their corresponding gradients and optimizer states, drastically reducing the memory footprint required for training.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•