Masterclass
As discussed in the chapter introduction, fine-tuning enormous pre-trained language models like Transformers for specific downstream tasks presents significant computational and storage challenges. Retraining the entire model, potentially billions of parameters, for each new task is often impractical. Parameter-Efficient Fine-Tuning (PEFT) methods aim to address this by adapting the model using only a small number of additional or modified parameters. Adapter modules represent one of the earliest and most influential PEFT techniques.
The central idea behind adapters is to inject small, trainable modules into the existing architecture of a pre-trained Transformer while keeping the original weights frozen. During fine-tuning, only the parameters of these newly added adapter modules are updated. This dramatically reduces the number of trainable parameters, often by several orders of magnitude, compared to full fine-tuning.
An adapter module typically consists of a bottleneck structure designed to project the input dimension down to a much smaller intermediate dimension and then project it back up to the original dimension. This structure includes:
The bottleneck dimension (the size of the intermediate layer) is a critical hyperparameter. It controls the number of parameters in the adapter and influences the trade-off between parameter efficiency and task performance. Smaller bottleneck dimensions lead to fewer parameters but might limit the adapter's capacity to learn task-specific features.
Basic structure of an Adapter module, showing the down-projection, non-linearity, up-projection, and residual connection.
Adapters are typically inserted into each Transformer block, usually after the Multi-Head Attention (MHA) sub-layer and the Feed-Forward Network (FFN) sub-layer, but before the final residual connection and layer normalization of that sub-layer.
Placement of Adapter modules within a standard Transformer block, typically after the MHA and FFN sub-layers. Note the residual connections around the adapters themselves.
Here's a simplified PyTorch implementation of an Adapter module:
import torch
import torch.nn as nn
import torch.nn.functional as F
class Adapter(nn.Module):
def __init__(self, d_model, bottleneck_dim, dropout=0.1):
super().__init__()
self.down_project = nn.Linear(d_model, bottleneck_dim)
self.non_linear = nn.GELU() # Common choice, could be ReLU etc.
self.up_project = nn.Linear(bottleneck_dim, d_model)
self.dropout = nn.Dropout(dropout)
# Initialize up_project weights to zero or near-zero
# This makes the adapter initially behave like an identity function
nn.init.zeros_(self.up_project.weight)
nn.init.zeros_(self.up_project.bias)
def forward(self, x):
# x is the input from the previous layer (e.g., MHA or FFN output)
adapter_input = x
x = self.down_project(x)
x = self.non_linear(x)
x = self.up_project(x)
x = self.dropout(x)
# Add the residual connection
output = adapter_input + x
return output
# Example usage within a hypothetical Transformer layer forward pass
# Assuming `self.mha_adapter` and `self.ffn_adapter` are instances
# of Adapter
# hidden_states = ... output from MHA ...
# adapted_mha_output = self.mha_adapter(hidden_states)
# hidden_states = layer_norm(adapted_mha_output + residual_mha_input)
# # Add main residual & LayerNorm
#
# feed_forward_output = ... output from FFN ...
# adapted_ffn_output = self.ffn_adapter(feed_forward_output)
# hidden_states = layer_norm(adapted_ffn_output + residual_ffn_input)
# # Add main residual & LayerNorm
Notice the initialization strategy for the up_project
layer. Initializing its weights and biases to zero ensures that at the beginning of fine-tuning, the adapter module essentially acts as an identity function (output = input + 0), preserving the original model's behavior. This helps stabilize the start of the fine-tuning process.
The fine-tuning procedure using adapters involves these steps:
requires_grad=False
for all parameters of the original Transformer model. This prevents them from being updated during training.requires_grad=True
and receive gradient updates.Because only a small fraction of the total parameters are trained (typically 0.5% to 5%), the memory requirements for optimizer states and gradients are drastically reduced, and the training process is much faster compared to updating the entire model.
Using adapters offers several advantages:
However, there are considerations:
In summary, adapter modules provide a practical and effective approach for adapting large pre-trained language models to various downstream tasks without the prohibitive costs associated with full fine-tuning. They represent a foundational technique in the growing field of parameter-efficient fine-tuning.
© 2025 ApX Machine Learning