Updating all weights in a neural network during training demands significant computational resources. Let's take for example a typical hardware setup trying to run standard supervised fine-tuning. The optimizer states, gradients, and model parameters quickly consume available GPU memory. Even for Small Language Models, this process often exceeds the VRAM limits of single consumer-grade graphics cards.
Parameter-Efficient Fine-Tuning (PEFT) addresses these hardware limitations. Instead of modifying every parameter, PEFT freezes the base model and injects a small number of trainable weights into specific layers. This reduces the memory footprint while maintaining the ability to teach the model new instructions.
A popular method for achieving this is Low-Rank Adaptation (LoRA). By using matrix decomposition, LoRA approximates weight updates using two smaller matrices. If the original weight matrix is , the update is represented as the product of matrices and . The updated weights are calculated using the following equation:
This mathematical approach drastically reduces the total number of parameters you need to update during training.
In this chapter, you will learn how to implement these techniques to train models under strict memory constraints. You will examine the mathematical principles of LoRA and how it minimizes trainable parameters. You will also use Quantized LoRA (QLoRA) to load base models in 4-bit precision, further minimizing memory usage without degrading performance.
By the end of this module, you will know how to configure target modules, tune rank parameters, and write the Python code required to apply a complete PEFT configuration to a base language model.
4.1 Understanding Full Fine-Tuning Limitations
4.2 Low-Rank Adaptation (LoRA) Principles
4.3 Quantized LoRA (QLoRA) and 4-bit Training
4.4 Configuring Target Modules and Rank
4.5 Hands-On Practical: Implementing a LoRA Configuration