Adapter modules offer a direct and intuitive approach to Parameter-Efficient Fine-Tuning (PEFT). Instead of modifying the existing weights of a large pre-trained language model (LLM), adapters introduce a small number of new, trainable parameters within the architecture while keeping the original LLM weights frozen. This strategy significantly reduces the number of parameters that need to be updated and stored for each downstream task, addressing the computational and storage challenges highlighted earlier.
The core idea revolves around injecting small neural network modules, the adapters, into the layers of the pre-trained transformer. These adapters are typically designed with a bottleneck architecture to maintain parameter efficiency.
A standard adapter module consists of two projection layers sandwiching a non-linearity. It takes the output h from a transformer sub-layer (like multi-head attention or the feed-forward network) as input.
Mathematically, the transformation applied by an adapter layer can be expressed as:
h′=h+Wup(σ(hWdown))During fine-tuning, only the adapter parameters (Wdown, Wup, and associated biases) are trained, while the original LLM parameters remain fixed. The bottleneck dimension m is a critical hyperparameter. A smaller m results in fewer trainable parameters but might limit the adapter's capacity to capture task-specific information. Conversely, a larger m increases capacity at the cost of reduced parameter efficiency. Typical values for m are orders of magnitude smaller than d. For instance, if d=4096, m might be chosen in the range of 64 to 256.
Diagram illustrating the typical bottleneck architecture of an adapter module with a residual connection.
Where adapters are inserted within the transformer architecture significantly influences their effectiveness. Early proposals explored various placements, leading to established patterns:
The choice of placement impacts the flow of information and how task-specific adaptations interact with the pre-trained representations. Placing adapters after both attention and FFNs allows modification of the outputs from both core computational units of the transformer block.
Simplified view comparing potential adapter placements within a transformer block (Houlsby vs. Pfeiffer). Input/Output represent connections to previous/next blocks.
Beyond the bottleneck dimension (m) and placement, other factors influence adapter performance:
Adapters offer a compelling balance between efficiency and effectiveness. They isolate task-specific knowledge into distinct modules, making it easy to switch between tasks by swapping adapters without affecting the base LLM. This modularity is a significant advantage in multi-task scenarios. However, the potential increase in inference latency and the need to tune placement and bottleneck size are important considerations. Compared to full fine-tuning, adapters dramatically reduce the adaptation cost while often achieving competitive performance on many NLP tasks.
© 2025 ApX Machine Learning