A hands-on walkthrough demonstrates the practical application of LoRA for fine-tuning pre-trained models. It utilizes the Hugging Face transformers and peft libraries, which offer convenient abstractions for applying LoRA. The process includes loading a model, configuring LoRA, applying it, and preparing for training.This practical assumes you have a working Python environment with torch, transformers, peft, and datasets installed.pip install -q torch transformers datasets peft accelerate bitsandbytes1. Setup: Loading Model and DataFirst, we import necessary libraries and load a pre-trained model and a dataset. For demonstration purposes, we'll use GPT-2, a common causal language model, and a small subset of the ELI5 dataset, which contains question-answer pairs.import torch from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling from peft import LoraConfig, get_peft_model, TaskType, PeftModel from datasets import load_dataset # Load Model and Tokenizer model_name = "gpt2" # Example: Use 'gpt2' or another suitable causal LM base_model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token # Set pad token for batching # Load and Prepare Dataset (using a small subset for demonstration) dataset = load_dataset("eli5", split="train_asks[:1000]") # Use a small slice # Basic preprocessing: Tokenize the text def preprocess_function(examples): # Concatenate question and answer for causal LM training texts = [q + " " + a[0] for q, a in zip(examples['title'], examples['answers.text'])] return tokenizer(texts, truncation=True, padding="max_length", max_length=128) tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset.column_names) print("Base model loaded:", model_name) print("Dataset loaded and tokenized.")2. Configuring LoRAThe core of applying LoRA via the peft library lies in the LoraConfig. This object specifies how LoRA should be integrated into the base model. We need to define the rank $r$, the scaling factor $\alpha$, and which layers LoRA should target.As discussed previously:r: The rank of the decomposition $BA$. A smaller $r$ means fewer trainable parameters but might limit adaptation capacity. Typical values range from 4 to 64.lora_alpha: The scaling factor applied to the LoRA output ($\frac{\alpha}{r} BAx$). It controls the magnitude of the adaptation relative to the original weights. A common practice is setting lora_alpha equal to or double the value of r.target_modules: A list of module names (or regex patterns) within the base model where LoRA matrices should replace or augment linear layers. For GPT-2, common targets are the attention projection layers (e.g., c_attn) or feed-forward network layers. You can inspect base_model.named_modules() to find suitable names.task_type: Specifies the task objective, influencing how PEFT structures might interact with model heads. For GPT-2 fine-tuning, TaskType.CAUSAL_LM is appropriate.# Define LoRA configuration lora_config = LoraConfig( r=16, # Rank r lora_alpha=32, # Scaling factor alpha target_modules=["c_attn"], # Apply LoRA to query, key, value projections in attention lora_dropout=0.05, # Dropout probability for LoRA layers bias="none", # Do not train bias parameters task_type=TaskType.CAUSAL_LM # Task type for causal language modeling ) print("LoRA Configuration:") print(lora_config)3. Applying LoRA to the ModelWith the LoraConfig defined, applying it to the base model is straightforward using get_peft_model. This function modifies the model architecture according to the configuration, freezing the original weights and inserting the trainable LoRA adapters.Let's compare the number of trainable parameters before and after applying LoRA.# Calculate original trainable parameters original_params = sum(p.numel() for p in base_model.parameters() if p.requires_grad) print(f"Original trainable parameters: {original_params:,}") # Apply LoRA configuration to the base model lora_model = get_peft_model(base_model, lora_config) # Calculate LoRA trainable parameters lora_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad) print(f"LoRA trainable parameters: {lora_params:,}") # Calculate the reduction reduction = (original_params - lora_params) / original_params * 100 print(f"Parameter reduction: {reduction:.2f}%") # Print trainable modules for verification lora_model.print_trainable_parameters()You should observe a significant reduction (often >99%) in the number of trainable parameters. The output of print_trainable_parameters confirms that only the newly added LoRA components (lora_A and lora_B) require gradients.digraph LoRA_Param_Comparison { rankdir=LR; node [shape=box, style=filled, fontname="sans-serif"]; subgraph cluster_0 { label = "Full Fine-Tuning"; bgcolor="#e9ecef"; node [fillcolor="#a5d8ff"]; FullParams [label="All Model\nParameters\n(e.g., ~124M for GPT-2)\nTrainable"]; } subgraph cluster_1 { label = "LoRA Fine-Tuning"; bgcolor="#e9ecef"; node [fillcolor="#96f2d7"]; BaseParams [label="Base Model\nParameters\n(e.g., ~124M for GPT-2)\nFrozen"]; LoraAdapters [label="LoRA Adapters\n(Matrices A, B)\n(e.g., ~0.7M for r=16)\nTrainable"]; BaseParams -> LoraAdapters [style=invis]; // Ensures vertical alignment if needed } FullParams -> BaseParams [style=invis, weight=0]; // Helper for layout FullParams -> LoraAdapters [label="vs", style=dashed, arrowhead=none, constraint=false, color="#495057"]; }Comparison of trainable parameters in full fine-tuning versus LoRA fine-tuning. LoRA significantly reduces the parameter count by only training small adapter matrices while keeping the base model frozen.4. Setting up the Training LoopNow, we can set up a standard training process using the transformers.Trainer. The main difference is that we pass the lora_model (the PEFT-modified model) instead of the base_model. The Trainer will automatically handle the optimization, focusing only on the trainable LoRA parameters.# Define Training Arguments output_dir = "./lora_gpt2_eli5_results" training_args = TrainingArguments( output_dir=output_dir, num_train_epochs=1, # Keep it short for demonstration per_device_train_batch_size=4, # Adjust based on your GPU memory logging_steps=50, save_steps=200, learning_rate=2e-4, # Typical learning rate for LoRA fp16=torch.cuda.is_available(), # Use mixed precision if available # Add other arguments as needed: weight_decay, warmup_steps, etc. ) # Data Collator for Causal LM data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) # Initialize Trainer trainer = Trainer( model=lora_model, # Use the PEFT model args=training_args, train_dataset=tokenized_dataset, tokenizer=tokenizer, data_collator=data_collator, ) print("Trainer initialized. Starting training...")5. Training the ModelInitiate the training process. The Trainer will handle the forward and backward passes, updating only the LoRA weights ($A$ and $B$). The base model weights remain unchanged.# Start training trainer.train() print("Training finished.")During training, observe the logs. The loss should decrease, indicating that the LoRA adapters are learning to adapt the model's behavior for the target task (in this case, mimicking the question-answer style of ELI5).6. Saving the LoRA AdapterAfter training, you don't save the entire model. Instead, you save only the trained LoRA adapter weights. This is one of the primary advantages of PEFT methods like LoRA – the resulting artifact is very small.# Define path to save the adapter adapter_path = f"{output_dir}/final_adapter" # Save the LoRA adapter weights lora_model.save_pretrained(adapter_path) print(f"LoRA adapter saved to: {adapter_path}") # You can verify the small size of the saved adapter directory # !ls -lh {adapter_path}The saved directory (adapter_path) will contain files like adapter_model.bin (the LoRA weights) and adapter_config.json (the LoRA configuration used). Its size will be in the megabytes range, dramatically smaller than the gigabytes required for the full base model.7. Loading and Using the Trained AdapterTo use the fine-tuned model for inference, you load the original base model first and then apply the saved LoRA adapter weights.# Load the base model again (or use the one already in memory) base_model_reloaded = AutoModelForCausalLM.from_pretrained(model_name) # Load the PEFT model by merging the adapter weights inference_model = PeftModel.from_pretrained(base_model_reloaded, adapter_path) # Ensure the model is in evaluation mode and on the correct device inference_model.eval() if torch.cuda.is_available(): inference_model.to("cuda") print("Base model loaded and LoRA adapter applied for inference.") # Example Inference (Optional) prompt = "What is the main cause of climate change?" inputs = tokenizer(prompt, return_tensors="pt") if torch.cuda.is_available(): inputs = {k: v.to("cuda") for k, v in inputs.items()} # Generate text with torch.no_grad(): outputs = inference_model.generate(**inputs, max_new_tokens=50, pad_token_id=tokenizer.eos_token_id) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print("\n--- Example Generation ---") print("Prompt:", prompt) print("Generated:", generated_text) print("------------------------")This completes a basic cycle of applying LoRA: configuring, wrapping the model, training, saving the adapter, and loading it for use. You've effectively fine-tuned a large language model by only training a tiny fraction of its parameters, demonstrating the efficiency of LoRA. The following sections and chapters will build upon this foundation, exploring more advanced configurations, variants like QLoRA, and evaluation techniques.