While Adapter Tuning, Prefix Tuning, and Prompt Tuning all offer pathways to efficient fine-tuning, their resource demands differ significantly. Understanding these differences in memory usage and computational requirements during both training and inference is essential for selecting the right PEFT method for your specific hardware constraints and application needs. Let's compare these methods, including LoRA (covered in Chapter 2), against the baseline of full fine-tuning.
Training Phase Footprints
During training, the primary resource consumers are:
- Model Weights: The parameters of the base LLM and any newly introduced parameters (like adapters or LoRA matrices).
- Activations: Intermediate results saved during the forward pass, needed for gradient calculation in the backward pass. Their size depends on the batch size, sequence length, and model architecture.
- Gradients: Calculated for each trainable parameter during the backward pass.
- Optimizer States: Information maintained by the optimizer (e.g., momentum, variance estimates in Adam/AdamW) for each trainable parameter. This often consumes a significant amount of memory, typically 2x or more the size of the trainable parameters themselves depending on the optimizer (e.g., AdamW stores momentum and variance).
Let's see how the different PEFT methods stack up:
- Full Fine-Tuning: This serves as our high-water mark. Gradients and optimizer states are required for all model parameters, leading to substantial memory requirements, often exceeding hundreds of gigabytes for large models. Computation involves a full forward and backward pass across all layers.
- LoRA: Drastically reduces memory needs by freezing the base model. Only the low-rank matrices A and B require gradients and optimizer states. The number of trainable parameters is typically small (e.g., 2×d×r per adapted linear layer, where d is the hidden dimension and r is the rank). Activation memory remains similar to full fine-tuning for the base model's forward pass. Computational overhead comes from the additional matrix multiplications (B×A) during the forward pass and the corresponding gradient calculations. The memory savings primarily stem from the reduced number of parameters needing gradients and optimizer states.
- Adapter Tuning: Similar to LoRA, the base model is frozen. Memory is consumed by the adapter parameters, their gradients, and optimizer states. The adapter bottleneck dimension dictates the number of trainable parameters. If the bottleneck dimension is small, memory usage is low, often comparable to or slightly more than LoRA depending on the configuration and number of inserted adapters. Computation involves passing activations through the small adapter modules, adding a modest overhead to the forward and backward passes.
- Prefix Tuning: Introduces trainable prefix vectors. The base model remains frozen. Memory is needed for the prefix parameters, their gradients, and optimizer states. The number of parameters depends on the prefix length and the model's hidden dimension. This typically results in a very low memory footprint, often lower than Adapters or LoRA. Computational overhead is minimal, mainly related to processing the prefix during attention calculations.
- Prompt Tuning / P-Tuning: These methods have the smallest memory footprint among PEFT techniques. Only the soft prompt embeddings are trained. The number of trainable parameters is extremely small, leading to negligible memory overhead for gradients and optimizer states compared to the base model size. Computational overhead during training is also minimal.
Inference Phase Footprints
During inference, gradients and optimizer states are discarded. The focus shifts to:
- Model Weights Storage: How much disk space is needed to store the fine-tuned model components.
- Inference Latency: The time taken to process a single input sequence, influenced by any additional computations introduced by the PEFT method.
Here's the inference comparison:
- Full Fine-Tuning: Requires storing the entire modified model, which is as large as the original base model. Inference latency is the baseline speed of the fine-tuned model architecture.
- LoRA: Offers flexibility. The LoRA matrices (A and B) can be merged with the original base model weights (W0) offline to produce a new weight matrix W=W0+BA. In this merged state, storage is identical to full fine-tuning, and inference latency is identical to the base model (no overhead). Alternatively, the base model and LoRA weights can be kept separate. This requires storing the original base model plus the small A and B matrices (very low storage overhead). However, inference requires computing the LoRA update (BAx) on-the-fly, adding a small latency overhead proportional to the rank r and the number of adapted layers.
- Adapter Tuning: Requires storing the original base model plus the weights of the adapter modules. Since adapters are small, the storage overhead is minimal. Inference requires passing activations through the adapter layers, introducing a small, fixed latency overhead compared to the base model.
- Prefix Tuning: Requires storing the base model plus the learned prefix vectors. Storage overhead is very low. Inference involves prepending the prefix, which typically adds negligible latency.
- Prompt Tuning / P-Tuning: Requires storing the base model plus the learned prompt embeddings. Storage overhead is extremely low. Inference latency is virtually identical to the base model, as it only involves processing a slightly modified input representation.
Comparative Summary
The choice between PEFT methods often involves trading off parameter efficiency, memory usage, computational cost, and potential impact on model performance (discussed later). Prompt Tuning is the lightest, while LoRA and Adapters offer a middle ground with potentially better representational power due to modifying internal model computations more directly.
Relative comparison of resource requirements for different fine-tuning strategies. Note that 'LoRA (Merged)' refers to storing the combined weights, while 'LoRA (Separate)' refers to storing the base model and LoRA matrices independently. Costs are illustrative relative ranks.
Understanding these footprints helps in planning experiments and deployments. If your primary constraint is training memory, Prompt Tuning or Prefix Tuning might be attractive. If inference latency is critical and you can afford the one-time merge cost, merged LoRA is identical to the base model. If you need flexible, low-storage deployment of multiple task-specific adaptations for a single base model, separate LoRA weights, Adapters, Prefixes, or Prompts are excellent choices. Techniques like QLoRA (discussed in the next chapter) further reduce the memory footprint, especially during training, by quantizing the base model itself.