As discussed in the chapter introduction, adapting billion-parameter foundation models presents substantial computational challenges, especially in few-shot scenarios where data is scarce. Full fine-tuning is often infeasible due to memory constraints and the risk of catastrophic forgetting. Parameter-Efficient Fine-Tuning (PEFT) methods offer a solution by modifying only a small fraction of the model's parameters. Among the earliest and most influential PEFT techniques are Adapter Modules.
The core idea behind adapters is elegantly simple: instead of modifying the existing weights of the large pre-trained model, insert small, trainable neural network modules between the layers of the frozen foundation model. During adaptation, only the parameters of these newly added adapter modules are updated, while the original foundation model weights remain untouched.
An adapter module typically consists of a sequence of operations designed to project the input activation to a lower-dimensional space (the bottleneck), apply a non-linearity, and then project it back to the original dimension. A common architecture is:
Mathematically, the transformation applied by an adapter module to an intermediate activation h∈Rd can be expressed as:
Adapter(h)=Wup(σ(Wdown(h)))Where Wdown∈Rm×d and Wup∈Rd×m. The dimension m is the adapter's bottleneck dimension, a critical hyperparameter typically chosen such that m≪d. This bottleneck is the source of the parameter efficiency. The final output h′ of the layer incorporating the adapter is:
h′=h+Adapter(h)The residual connection is important. It allows the adapter to learn modifications to the layer's function. If the adapter learns to output zero (which can be encouraged through initialization), the adapted layer behaves identically to the original pre-trained layer.
Adapters can be inserted at various locations within a Transformer block. Common strategies include placing them sequentially after the multi-head self-attention (MHSA) sublayer and the feed-forward network (FFN) sublayer.
Insertion points for Adapter Modules within a standard Transformer block. Adapters are typically placed after the main sublayers (MHSA, FFN) and integrated via residual connections. The original Transformer block parameters remain frozen.
During adaptation for a new task, the parameters of the foundation model (θFM) are frozen. Only the parameters of the inserted adapter modules (θadapter, specifically Wdown and Wup in all adapters) are trainable. The training process involves minimizing a task-specific loss function Ltask (e.g., cross-entropy for classification) with respect to θadapter:
θadapterminLtask(Model(X;θFM,θadapter),Y)Where (X,Y) represents the few-shot training data for the target task.
A common initialization strategy is to initialize Wdown randomly (e.g., using Xavier or He initialization) and initialize Wup to near-zero values. This ensures that at the beginning of training, the adapter modules have minimal impact (Adapter(h)≈0), and the model behaves like the original pre-trained foundation model, providing a stable starting point for optimization.
Adapters offer several compelling advantages for few-shot adaptation:
However, there are also considerations:
Adapters represent a foundational technique in PEFT. They demonstrate the viability of adapting large models by introducing and training only a small number of new parameters, paving the way for other methods like LoRA and prompt tuning, which explore different ways to achieve parameter efficiency, as we will see in the following sections.
© 2025 ApX Machine Learning