Fine-tuning Strategies for Pre-trained MoE Models

Adapting a pre-trained Mixture of Experts model to a new task presents a different set of challenges compared to fine-tuning a dense model. While pre-training has already endowed the experts with specialized knowledge, the goal of fine-tuning is to steer this knowledge toward a new domain or capability without incurring the full cost of training from scratch or causing catastrophic forgetting. The immense scale of MoE models makes a naive full fine-tuning approach impractical for most applications. Therefore, a collection of more surgical strategies is required.

These strategies balance three competing factors: computational cost, memory footprint, and final task performance. Your choice of strategy will depend on your available resources and performance requirements.

Full Fine-tuning: The Brute-Force Approach

The most direct method is to update all model parameters, including the gating network and every single expert, using the new task's dataset. This is analogous to standard fine-tuning for dense models.

Process: The training proceeds much like pre-training, requiring the same distributed setup with expert parallelism to manage the model's size. The learning rate is typically set much lower than in pre-training.
Pros: This approach has the potential to achieve the highest possible performance on the target task, as every parameter is free to adapt.
Cons: The resource requirement is a significant barrier. It demands a large-scale GPU cluster and is often prohibitively expensive. Furthermore, it carries a high risk of disrupting the valuable specializations learned during pre-training, a phenomenon known as catastrophic forgetting. The model might overfit to the new, often smaller, dataset and lose its general capabilities.

Full fine-tuning is generally reserved for situations where the fine-tuning dataset is very large and of high quality, and maximum performance is the primary objective, regardless of cost.

Selective Parameter Updates

A more resource-efficient and common set of techniques involves freezing large portions of the model and updating only specific components. This isolates the changes, preserves general knowledge, and dramatically reduces the computational load.

The diagram below illustrates three different update strategies for a single MoE layer.

Comparison of fine-tuning strategies. Green components indicate parameters being updated, while gray components are frozen.

Fine-tuning the Gating Network Only

This is one of the most efficient methods. All expert networks are frozen, and only the parameters of the gating network (the router) are updated.

Mechanism: The model learns to route tokens from the new task's data distribution to the most appropriate, pre-existing experts. It doesn't learn new specialized knowledge within the experts, but rather learns a new policy for using the knowledge it already has.
Use Case: This is highly effective for domain adaptation, where the target task (e.g., summarizing legal documents) requires a subset of the knowledge already present in the general-purpose pre-trained model (e.g., one trained on the web).
Benefit: The memory and compute requirements are minimal, as you are only updating the small router network. This can often be done on a single machine.

Fine-tuning a Subset of Experts

A logical next step is to unfreeze a small number of experts in addition to the router. This provides a balance between efficiency and adaptability. There are two common ways to do this:

Fine-tune Top-Activated Experts: First, run an inference pass on your fine-tuning dataset and record which experts are most frequently activated by the router. Then, freeze all other experts and fine-tune only this "hot" subset along with the router. This focuses computation on the parts of the model most relevant to your task.
Add and Train New Experts: Freeze all original experts. Add a small number of new, randomly initialized experts to the MoE layer. During fine-tuning, you only train these new experts and the router. This approach isolates the task-specific knowledge into the new experts, maximally preserving the original model's general capabilities.

Parameter-Efficient Fine-Tuning (PEFT) Methods

Techniques like Low-Rank Adaptation (LoRA) introduce small, trainable "adapter" matrices into the model while keeping the massive pre-trained weights frozen. This principle extends naturally to MoE models.

Instead of updating the original weight matrix $W$ , LoRA approximates the update with a low-rank decomposition, $W' = W + BA$ , where $B$ and $A$ are the small, trainable matrices.

For an MoE model, you can apply LoRA in several ways:

LoRA on Non-MoE Layers: Apply LoRA to the self-attention and other non-MoE feed-forward layers. This adapts the model's general processing without touching the expert weights.
LoRA on the Router: Apply LoRA to the gating network. This is a highly efficient way to adapt the routing policy, similar to fine-tuning the router only but with even fewer trainable parameters.
LoRA on Experts: You can apply LoRA to the experts themselves. Applying it to all experts can still be demanding. A more practical strategy is to apply LoRA only to a subset of experts, such as the most frequently activated ones for your task.

A combination of these, for instance applying LoRA to the attention layers and the router, is often a very effective and efficient starting point.

The Auxiliary Loss in Fine-tuning

A critical issue for MoE models is the fate of the load-balancing auxiliary loss, $L_{aux}$ , during fine-tuning. Discarding it is a mistake. Without the pressure to distribute tokens, the router can quickly experience "representational collapse," learning to send all tokens to a single expert that becomes highly over-specialized for the fine-tuning task. This destroys the benefits of the MoE architecture.

Therefore, the auxiliary loss should be kept active during fine-tuning. However, the fine-tuning dataset is typically much narrower than the pre-training corpus. Forcing a perfectly uniform distribution across experts might be counterproductive. A common practice is to reduce the weight of the auxiliary loss by decreasing its coefficient $\alpha$ .

L_{total} = L_{cross\_entropy} + \alpha_{finetune} \cdot L_{aux}

A typical value for $\alpha_{finetune}$ might be an order of magnitude smaller than what was used during pre-training. This maintains enough pressure to prevent expert collapse while giving the router enough flexibility to prefer certain experts that are genuinely more relevant to the downstream task. Always monitor expert utilization during fine-tuning to ensure a healthy distribution.

Was this section helpful?

References

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017 arXiv preprint arXiv:1701.06538 DOI: 10.48550/arXiv.1701.06538 - Introduces the sparsely-gated Mixture-of-Experts (MoE) layer and the auxiliary loss for load balancing, foundational for MoE architectures.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, Noam Shazeer, 2021 Journal of Machine Learning Research DOI: 10.48550/arXiv.2101.03961 - Presents Switch Transformers, a highly scalable MoE architecture, detailing training methodologies and the auxiliary loss's role in large-scale sparse models.
LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2021 arXiv preprint arXiv:2106.09685 DOI: 10.48550/arXiv.2106.09685 - Proposes Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that reduces trainable parameters for adapting large models.