Now that we have theoretically explored Adapter Tuning, let's put these concepts into practice. This section provides a hands-on guide to fine-tuning a pre-trained Transformer model using adapter modules. We will utilize the adapter-transformers
library, an extension of Hugging Face's transformers
, specifically designed to streamline the use of adapters and other PEFT methods.
Our goal is to adapt bert-base-uncased
for a sentiment classification task (using the GLUE SST-2 dataset) by only training lightweight adapter modules inserted into the model. This approach keeps the vast majority of the original model parameters frozen, significantly reducing computational and storage requirements compared to full fine-tuning.
First, ensure you have the necessary libraries installed. We'll need adapter-transformers
(which includes transformers
and torch
), and datasets
for data handling.
pip install -U adapter-transformers datasets
We start by loading the pre-trained model and tokenizer, just as you would with the standard transformers
library. We'll also load the Stanford Sentiment Treebank (SST-2) dataset.
from transformers import AutoTokenizer
from adapter_transformers import AutoAdapterModel # Use AutoAdapterModel
from datasets import load_dataset
# Load tokenizer and model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoAdapterModel.from_pretrained(model_name) # Key change: Use AutoAdapterModel
# Load dataset
dataset = load_dataset("glue", "sst2")
# Preprocess data
def encode_batch(batch):
"""Tokenizes the sentences."""
return tokenizer(batch["sentence"], max_length=80, truncation=True, padding="max_length")
dataset = dataset.map(encode_batch, batched=True)
dataset = dataset.rename_column("label", "labels") # Rename label column for Trainer compatibility
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
print("Dataset sample:", dataset["train"][0])
Using AutoAdapterModel
instead of AutoModelForSequenceClassification
is important, as it provides the necessary methods for managing adapters.
Now, we add adapter modules to the loaded BERT model. The adapter-transformers
library makes this straightforward. We'll add bottleneck adapters (the classic adapter type) to each layer of the Transformer model.
from adapter_transformers.training import AdapterArguments
from transformers import AdapterConfig
# Configure the adapter
# Using Pfeiffer config: bottleneck adapter with reduction factor 16
adapter_config = AdapterConfig.load("pfeiffer", reduction_factor=16)
# Add adapter to the model
# Give it a unique name, e.g., "sentiment_adapter"
adapter_name = "sentiment_adapter"
model.add_adapter(adapter_name, config=adapter_config)
# Add a classification head for our task associated with this adapter
num_labels = dataset["train"].features["labels"].num_classes
model.add_classification_head(
adapter_name,
num_labels=num_labels,
id2label={ 0: "NEGATIVE", 1: "POSITIVE" } # Optional label mapping
)
# Activate the adapter for training
model.train_adapter(adapter_name)
# Verify which parameters are trainable
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total Parameters: {total_params}")
print(f"Trainable Parameters (Adapter + Head): {trainable_params}")
print(f"Trainable %: {100 * trainable_params / total_params:.4f}%")
Observe the output. You'll notice that trainable_params
is only a small fraction of total_params
. The add_adapter
method inserts the adapter modules (typically after the attention and feed-forward layers), and add_classification_head
adds a new final layer for our specific task, also associated with the adapter. Crucially, train_adapter(adapter_name)
freezes the entire pre-trained BERT model and unfreezes only the parameters belonging to the specified adapter (sentiment_adapter
) and its associated classification head. This selective freezing/unfreezing is the core mechanism enabling parameter-efficient training.
The AdapterConfig.load("pfeiffer", reduction_factor=16)
specifies the architecture. "pfeiffer" refers to a standard bottleneck adapter configuration with layer normalization and specific activation functions. The reduction_factor=16
means the bottleneck dimension will be dmodel/16, where dmodel is the hidden dimension of the BERT model (768 for bert-base-uncased
). Adjusting this factor directly controls the trade-off between parameter count and potential performance.
We use the standard Hugging Face Trainer
for the training process. The setup is similar to full fine-tuning, but the Trainer
will automatically handle the fact that only adapter parameters are trainable.
import numpy as np
from transformers import TrainingArguments, Trainer, EvalPrediction
from datasets import load_metric
# Define training arguments
# Note: Smaller batch size & fewer epochs suitable for demonstration
training_args = TrainingArguments(
output_dir="./adapter_sst2_output",
learning_rate=1e-4, # Adapters often benefit from slightly higher LR
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
logging_steps=100,
evaluation_strategy="epoch",
save_strategy="epoch", # Save adapter checkpoints each epoch
load_best_model_at_end=True,
metric_for_best_model="accuracy",
remove_unused_columns=False, # Important for adapter trainer
)
# Define evaluation metric
metric = load_metric("glue", "sst2")
def compute_metrics(p: EvalPrediction):
preds = np.argmax(p.predictions, axis=1)
return metric.compute(predictions=preds, references=p.label_ids)
# Instantiate Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
Note the learning_rate
might be slightly higher (e.g., 1e-4
) compared to full fine-tuning (~2e-5
), as adapters sometimes converge better with a larger step size. The remove_unused_columns=False
is often needed when working with adapter models within the Trainer
.
Now, we can start the training process. During this phase, gradients will only be computed and applied to the adapter and classification head weights.
# Start training
train_result = trainer.train()
# Log training metrics
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
# Evaluate the best model
eval_metrics = trainer.evaluate(eval_dataset=dataset["validation"])
trainer.log_metrics("eval", eval_metrics)
trainer.save_metrics("eval", eval_metrics)
Monitor the training progress and evaluation metrics (accuracy in this case) printed during and after training. You should see the model learning the sentiment classification task effectively, despite only updating a small subset of parameters.
A significant advantage of Adapter Tuning is the ability to save the adapter weights independently. The base model remains unchanged.
# Define path to save the adapter
output_adapter_dir = "./saved_adapters/sst2_adapter"
# Save the adapter weights
model.save_adapter(output_adapter_dir, adapter_name)
# You can also save the head if needed separately, or it's often saved with the adapter
# model.save_head(output_adapter_dir, adapter_name)
print(f"Adapter '{adapter_name}' saved to {output_adapter_dir}")
Navigate to the output_adapter_dir
. You will find configuration files (adapter_config.json
) and the weight file (e.g., pytorch_adapter.bin
). Notice how small these files are compared to the full model checkpoint (hundreds of megabytes for BERT-base). This demonstrates the storage efficiency of adapters.
To use the trained adapter, you load the original base model and then load the specific adapter weights.
from adapter_transformers import AutoAdapterModel # Use the same class
from transformers import TextClassificationPipeline
# Load the base model again (imagine this is a fresh session)
inference_model = AutoAdapterModel.from_pretrained(model_name)
inference_tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the adapter weights from the saved directory
loaded_adapter_name = inference_model.load_adapter(output_adapter_dir) # Returns the name it was saved under
# IMPORTANT: Set the active adapter for inference
inference_model.set_active_adapters(loaded_adapter_name)
# You might need to explicitly load the head if it wasn't saved with the adapter
# or if you saved it separately. Often loading the adapter loads the associated head.
# Check adapter-transformers documentation for specifics on head loading if issues arise.
# Perform inference using a pipeline
classifier = TextClassificationPipeline(model=inference_model, tokenizer=inference_tokenizer, device=training_args.device.index if training_args.device else -1)
# Example sentences
sentences = [
"This movie was absolutely fantastic!",
"I was completely bored throughout the entire film.",
"The acting was decent, but the plot was predictable."
]
results = classifier(sentences)
for sentence, result in zip(sentences, results):
print(f"Sentence: {sentence}")
print(f"Predicted Label: {result['label']}, Score: {result['score']:.4f}\n")
This demonstrates the modularity: the large base model can be loaded once, and different lightweight adapters can be loaded on top to switch between tasks without needing multiple copies of the full model. set_active_adapters
tells the model which adapter(s) to use for the forward pass during inference.
This practical exercise illustrates the core workflow of Adapter Tuning: adding adapter modules, freezing the base model, training only the adapters, saving them separately, and loading them for efficient inference. You've successfully fine-tuned a powerful LLM for a specific task while modifying only a tiny fraction of its parameters, showcasing the efficiency and modularity benefits of this PEFT technique.
© 2025 ApX Machine Learning