Having explored the theoretical underpinnings of various PEFT techniques like LoRA, QLoRA, and Adapter modules, let's turn our attention to practical application. The Hugging Face peft
library provides a standardized and user-friendly interface for integrating these methods with models from the transformers
ecosystem. This library significantly simplifies the process of adapting large models efficiently.
The central idea behind peft
is to wrap a standard transformers
model with a PeftModel
object. This wrapper manages the modification of the base model according to a specified PEFT configuration, ensuring that only the designated adapter parameters are trained.
PeftConfig
and get_peft_model
At the heart of the library are configuration objects, subclasses of PeftConfig
, which define the specific PEFT method and its hyperparameters. For instance, LoraConfig
is used for LoRA and QLoRA, PromptTuningConfig
for Prompt Tuning, and so on.
You typically start by loading your base pre-trained model using transformers
. Then, you define a configuration object (e.g., LoraConfig
) detailing the PEFT strategy. Finally, the get_peft_model
function takes the base model and the configuration to create the PEFT-enabled model ready for training.
Workflow for creating a PEFT model using the Hugging Face
peft
library.
peft
Let's illustrate how to configure a model for LoRA fine-tuning. We'll load a base model and apply LoRA modifications to its attention layers.
# Assume necessary imports:
# from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
# 1. Load Base Model (Example: using a smaller model for demonstration)
model_name = "meta-llama/Llama-2-7b-hf" # Or any other compatible model
base_model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 2. Define LoRA Configuration
lora_config = LoraConfig(
r=16, # Rank of the update matrices. Lower rank means fewer parameters.
lora_alpha=32, # LoRA scaling factor.
target_modules=["q_proj", "v_proj"], # Apply LoRA to query and value projections
lora_dropout=0.05, # Dropout probability for LoRA layers.
bias="none", # Typically set to 'none' for LoRA.
task_type=TaskType.CAUSAL_LM # Specify task type (e.g., CAUSAL_LM, SEQ_2_SEQ_LM)
)
# 3. Create PeftModel
lora_model = get_peft_model(base_model, lora_config)
# Optional: Print trainable parameters to verify efficiency
lora_model.print_trainable_parameters()
# Example Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622
In this configuration:
r
: Defines the rank of the low-rank matrices (WA and WB). A common starting point is 8, 16, or 32. Higher ranks increase trainable parameters but may offer better performance up to a point.lora_alpha
: Acts as a scaling factor for the LoRA activations, typically set to twice the rank r
. It helps balance the influence of the original weights and the LoRA updates.target_modules
: Specifies which layers within the base model should receive the LoRA matrices. Identifying the optimal modules (often attention mechanism layers like query, key, value projections) is important for performance. You can inspect the base model's architecture (print(base_model)
) to find layer names.task_type
: Informs peft
about the model's objective, ensuring adapters are applied correctly (e.g., for causal language modeling).Notice the dramatic reduction in trainable parameters compared to the total number of parameters in the base model. This highlights the efficiency gain of LoRA.
peft
QLoRA builds upon LoRA by applying it to a quantized base model, usually loaded in 4-bit precision. This further reduces memory requirements, making it feasible to fine-tune even larger models on consumer-grade hardware.
The peft
library integrates seamlessly with the bitsandbytes
library for quantization.
# Assume necessary imports as before
# 1. Configure Quantization (using bitsandbytes)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Use NF4 (Normalized Float 4) for better precision
bnb_4bit_compute_dtype="bfloat16", # Or float16, compute dtype during forward pass
bnb_4bit_use_double_quant=True, # Optional: Use double quantization
)
# 2. Load Base Model with Quantization
model_name = "meta-llama/Llama-2-7b-hf" # Example model
base_model_quantized = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto" # Automatically distribute model layers across devices if needed
)
# Prepare model for k-bit training (gradient checkpointing, input/output embeddings)
base_model_quantized = prepare_model_for_kbit_training(base_model_quantized)
# 3. Define LoRA Configuration (similar to before)
qlora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"], # Identify target modules in the specific model
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
# 4. Create PeftModel
qlora_model = get_peft_model(base_model_quantized, qlora_config)
# Optional: Print trainable parameters
qlora_model.print_trainable_parameters()
# Example Output might be similar or slightly different based on quantization details
The primary difference lies in loading the base model with the quantization_config
argument. The prepare_model_for_kbit_training
utility function performs necessary setup steps like enabling gradient checkpointing to save memory during training and ensuring input/output embeddings are compatible with the quantized model. The LoraConfig
itself remains largely the same, but it's now applied over the quantized weights.
While LoRA and QLoRA are widely used, peft
supports other techniques as well:
AdaptionPromptConfig
. Requires specifying adapter layers and dimensions.PromptTuningConfig
. Specify the number of virtual tokens to be tuned.PrefixTuningConfig
. Similar to Prompt Tuning but adds tunable prefixes to hidden states.The general workflow remains consistent: define the appropriate configuration object and pass it along with the base model to get_peft_model
. Consult the official peft
documentation for the specific parameters required for each configuration type.
transformers.Trainer
A significant advantage of peft
is its smooth integration with the standard transformers.Trainer
API. Once you have created your PeftModel
(e.g., lora_model
or qlora_model
), you can pass it directly to the Trainer
just like you would a regular model.
# Assume imports: from transformers import Trainer, TrainingArguments
# Assume 'train_dataset' and 'eval_dataset' are prepared
# Define Training Arguments
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=3,
logging_steps=10,
save_steps=50,
evaluation_strategy="steps",
eval_steps=50,
# Add other arguments as needed (fp16, optim, etc.)
)
# Initialize Trainer with the PeftModel
trainer = Trainer(
model=qlora_model, # Pass the PEFT model here
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
# Add data collator, tokenizer if needed
)
# Start Training
trainer.train()
The Trainer
automatically detects that it's working with a PeftModel
and ensures that only the adapter parameters (e.g., LoRA matrices) are updated during backpropagation. The gradients for the frozen base model parameters are not computed or applied, leading to the expected computational and memory savings.
After training, you don't save the entire model (which would defeat the purpose of PEFT). Instead, you save only the trained adapter weights.
# Save adapters after training
adapter_path = "./qlora_adapters"
qlora_model.save_pretrained(adapter_path)
# Optionally save the tokenizer too
# tokenizer.save_pretrained(adapter_path)
This saves the adapter configuration (adapter_config.json
) and the trained weights (adapter_model.bin
or .safetensors
) to the specified directory. The base model is not saved here, resulting in very small checkpoint sizes.
To load the adapters for inference or further training, you first load the original base model (potentially with the same quantization configuration) and then apply the saved adapters:
# Assume imports and base model loading (quantized or not)
# from peft import PeftModel, PeftConfig
# Load the base model first (same configuration as during training)
# Example: base_model_for_inference = AutoModelForCausalLM.from_pretrained(...)
# Load adapters onto the base model
inference_model = PeftModel.from_pretrained(base_model_for_inference, adapter_path)
# Merge adapters (Optional, for deployment)
# merged_model = inference_model.merge_and_unload()
# Now 'merged_model' is a standard transformers model with weights updated
# Note: Merging may not be supported or straightforward with quantized models (e.g., 4-bit)
The PeftModel.from_pretrained
function loads the adapter configuration and weights and applies them to the provided base model.
For deployment scenarios where dynamic adapter switching isn't needed, you might merge the adapter weights directly into the base model's weights using model.merge_and_unload()
. This creates a standard transformers
model with the combined weights, potentially simplifying the inference stack and offering a slight speedup as the adapter logic is removed. However, this increases the model size back to that of the base model and removes the ability to easily swap adapters. Merging is generally simpler with non-quantized models; merging adapters into quantized models (especially 4-bit) can be complex or may require specific library support.
target_modules
: The choice of which layers to adapt (e.g., via target_modules
in LoraConfig
) significantly impacts performance. Common choices for transformers include attention projection layers (q_proj
, k_proj
, v_proj
, o_proj
) and sometimes feed-forward network layers (gate_proj
, up_proj
, down_proj
). Experimentation might be needed. For some models, specifying all-linear
might be a convenient starting point if memory permits.r
, lora_alpha
, num_virtual_tokens
). These need tuning alongside standard training hyperparameters like learning rate and batch size.peft
, transformers
, accelerate
, and potentially bitsandbytes
. Ensure you have compatible versions installed to avoid runtime errors. Check the documentation of each library for version requirements.The Hugging Face peft
library provides a powerful and flexible framework for implementing parameter-efficient fine-tuning. By understanding its core components and workflow, you can effectively adapt large language models for specific tasks while managing computational resources efficiently. The hands-on sections later in this chapter will provide concrete examples of applying LoRA and QLoRA using this library.
© 2025 ApX Machine Learning