All Courses

Implementation with Hugging Face PEFT Library

Having explored the theoretical underpinnings of various PEFT techniques like LoRA, QLoRA, and Adapter modules, let's turn our attention to practical application. The Hugging Face peft library provides a standardized and user-friendly interface for integrating these methods with models from the transformers ecosystem. This library significantly simplifies the process of adapting large models efficiently.

The central idea behind peft is to wrap a standard transformers model with a PeftModel object. This wrapper manages the modification of the base model according to a specified PEFT configuration, ensuring that only the designated adapter parameters are trained.

Core Components: `PeftConfig` and `get_peft_model`

At the heart of the library are configuration objects, subclasses of PeftConfig, which define the specific PEFT method and its hyperparameters. For instance, LoraConfig is used for LoRA and QLoRA, PromptTuningConfig for Prompt Tuning, and so on.

You typically start by loading your base pre-trained model using transformers. Then, you define a configuration object (e.g., LoraConfig) detailing the PEFT strategy. Finally, the get_peft_model function takes the base model and the configuration to create the PEFT-enabled model ready for training.

Workflow for creating a PEFT model using the Hugging Face peft library.

Applying LoRA with `peft`

Let's illustrate how to configure a model for LoRA fine-tuning. We'll load a base model and apply LoRA modifications to its attention layers.

# Assume necessary imports:
# from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training

# 1. Load Base Model (Example: using a smaller model for demonstration)
model_name = "meta-llama/Llama-2-7b-hf" # Or any other compatible model
base_model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 2. Define LoRA Configuration
lora_config = LoraConfig(
    r=16, # Rank of the update matrices. Lower rank means fewer parameters.
    lora_alpha=32, # LoRA scaling factor.
    target_modules=["q_proj", "v_proj"], # Apply LoRA to query and value projections
    lora_dropout=0.05, # Dropout probability for LoRA layers.
    bias="none", # Typically set to 'none' for LoRA.
    task_type=TaskType.CAUSAL_LM # Specify task type (e.g., CAUSAL_LM, SEQ_2_SEQ_LM)
)

# 3. Create PeftModel
lora_model = get_peft_model(base_model, lora_config)

# Optional: Print trainable parameters to verify efficiency
lora_model.print_trainable_parameters()
# Example Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

In this configuration:

r: Defines the rank of the low-rank matrices ( $W_A$ and $W_B$ ). A common starting point is 8, 16, or 32. Higher ranks increase trainable parameters but may offer better performance up to a point.
lora_alpha: Acts as a scaling factor for the LoRA activations, typically set to twice the rank r. It helps balance the influence of the original weights and the LoRA updates.
target_modules: Specifies which layers within the base model should receive the LoRA matrices. Identifying the optimal modules (often attention mechanism layers like query, key, value projections) is important for performance. You can inspect the base model's architecture (print(base_model)) to find layer names.
task_type: Informs peft about the model's objective, ensuring adapters are applied correctly (e.g., for causal language modeling).

Notice the dramatic reduction in trainable parameters compared to the total number of parameters in the base model. This highlights the efficiency gain of LoRA.

Applying QLoRA with `peft`

QLoRA builds upon LoRA by applying it to a quantized base model, usually loaded in 4-bit precision. This further reduces memory requirements, making it feasible to fine-tune even larger models on consumer-grade hardware.

The peft library integrates with the bitsandbytes library for quantization.

# Assume necessary imports as before

# 1. Configure Quantization (using bitsandbytes)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", # Use NF4 (Normalized Float 4) for better precision
    bnb_4bit_compute_dtype="bfloat16", # Or float16, compute dtype during forward pass
    bnb_4bit_use_double_quant=True, # Optional: Use double quantization
)

# 2. Load Base Model with Quantization
model_name = "meta-llama/Llama-2-7b-hf" # Example model
base_model_quantized = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto" # Automatically distribute model layers across devices if needed
)

# Prepare model for k-bit training (gradient checkpointing, input/output embeddings)
base_model_quantized = prepare_model_for_kbit_training(base_model_quantized)

# 3. Define LoRA Configuration (similar to before)
qlora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"], # Identify target modules in the specific model
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# 4. Create PeftModel
qlora_model = get_peft_model(base_model_quantized, qlora_config)

# Optional: Print trainable parameters
qlora_model.print_trainable_parameters()
# Example Output might be similar or slightly different based on quantization details

The primary difference lies in loading the base model with the quantization_config argument. The prepare_model_for_kbit_training utility function performs necessary setup steps like enabling gradient checkpointing to save memory during training and ensuring input/output embeddings are compatible with the quantized model. The LoraConfig itself remains largely the same, but it's now applied over the quantized weights.

Support for Other PEFT Methods

While LoRA and QLoRA are widely used, peft supports other techniques as well:

Adapter Tuning: Use AdaptionPromptConfig. Requires specifying adapter layers and dimensions.
Prompt Tuning: Use PromptTuningConfig. Specify the number of virtual tokens to be tuned.
Prefix Tuning: Use PrefixTuningConfig. Similar to Prompt Tuning but adds tunable prefixes to hidden states.

The general workflow remains consistent: define the appropriate configuration object and pass it along with the base model to get_peft_model. Consult the official peft documentation for the specific parameters required for each configuration type.

Integration with `transformers.Trainer`

A significant advantage of peft is its smooth integration with the standard transformers.Trainer API. Once you have created your PeftModel (e.g., lora_model or qlora_model), you can pass it directly to the Trainer just like you would a regular model.

# Assume imports: from transformers import Trainer, TrainingArguments
# Assume 'train_dataset' and 'eval_dataset' are prepared

# Define Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    logging_steps=10,
    save_steps=50,
    evaluation_strategy="steps",
    eval_steps=50,
    # Add other arguments as needed (fp16, optim, etc.)
)

# Initialize Trainer with the PeftModel
trainer = Trainer(
    model=qlora_model, # Pass the PEFT model here
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    # Add data collator, tokenizer if needed
)

# Start Training
trainer.train()

The Trainer automatically detects that it's working with a PeftModel and ensures that only the adapter parameters (e.g., LoRA matrices) are updated during backpropagation. The gradients for the frozen base model parameters are not computed or applied, leading to the expected computational and memory savings.

Saving, Loading, and Merging Adapters

After training, you don't save the entire model (which would defeat the purpose of PEFT). Instead, you save only the trained adapter weights.

# Save adapters after training
adapter_path = "./qlora_adapters"
qlora_model.save_pretrained(adapter_path)

# Optionally save the tokenizer too
# tokenizer.save_pretrained(adapter_path)

This saves the adapter configuration (adapter_config.json) and the trained weights (adapter_model.bin or .safetensors) to the specified directory. The base model is not saved here, resulting in very small checkpoint sizes.

To load the adapters for inference or further training, you first load the original base model (potentially with the same quantization configuration) and then apply the saved adapters:

# Assume imports and base model loading (quantized or not)
# from peft import PeftModel, PeftConfig

# Load the base model first (same configuration as during training)
# Example: base_model_for_inference = AutoModelForCausalLM.from_pretrained(...)

# Load adapters onto the base model
inference_model = PeftModel.from_pretrained(base_model_for_inference, adapter_path)

# Merge adapters (Optional, for deployment)
# merged_model = inference_model.merge_and_unload()
# Now 'merged_model' is a standard transformers model with weights updated
# Note: Merging may not be supported or straightforward with quantized models (e.g., 4-bit)

The PeftModel.from_pretrained function loads the adapter configuration and weights and applies them to the provided base model.

For deployment scenarios where dynamic adapter switching isn't needed, you might merge the adapter weights directly into the base model's weights using model.merge_and_unload(). This creates a standard transformers model with the combined weights, potentially simplifying the inference stack and offering a slight speedup as the adapter logic is removed. However, this increases the model size back to that of the base model and removes the ability to easily swap adapters. Merging is generally simpler with non-quantized models; merging adapters into quantized models (especially 4-bit) can be complex or may require specific library support.

Practical Considerations

Choosing target_modules: The choice of which layers to adapt (e.g., via target_modules in LoraConfig) significantly impacts performance. Common choices for transformers include attention projection layers (q_proj, k_proj, v_proj, o_proj) and sometimes feed-forward network layers (gate_proj, up_proj, down_proj). Experimentation might be needed. For some models, specifying all-linear might be a convenient starting point if memory permits.
Hyperparameter Tuning: PEFT methods introduce their own hyperparameters (e.g., r, lora_alpha, num_virtual_tokens). These need tuning alongside standard training hyperparameters like learning rate and batch size.
Library Compatibility: PEFT involves interactions between peft, transformers, accelerate, and potentially bitsandbytes. Ensure you have compatible versions installed to avoid runtime errors. Check the documentation of each library for version requirements.

The Hugging Face peft library provides a powerful and flexible framework for implementing parameter-efficient fine-tuning. By understanding its core components and workflow, you can effectively adapt large language models for specific tasks while managing computational resources efficiently. The hands-on sections later in this chapter will provide concrete examples of applying LoRA and QLoRA using this library.

Was this section helpful?

Implementation with Hugging Face PEFT Library

Core Components: PeftConfig and get_peft_model

Applying LoRA with peft

Applying QLoRA with peft