Chapter 3: Training and Optimization of Large-Scale MoEs

Implementing a Mixture of Experts (MoE) layer is one part of the problem. Training it effectively, especially when dealing with models containing hundreds or thousands of experts, presents a distinct set of engineering challenges. Standard training procedures are often insufficient for these sparse architectures, which demand specialized techniques to manage immense parameter counts and maintain stability.

This chapter shifts from architectural theory to the practical mechanics of training and optimization. You will learn the methods required to successfully train large-scale MoE models from scratch and fine-tune existing ones.

Distributed Training: We will cover expert parallelism, a technique for distributing experts across multiple devices. You will learn how to combine it with data and model parallelism for maximum efficiency.
Performance Tuning: We will analyze the capacity factor, a critical hyperparameter that balances computational load against token loss, and its direct impact on model performance.
Training Stability: You will learn to identify and mitigate common training instabilities, such as issues arising from the router's $z$ -loss, a component of the auxiliary load-balancing loss function.
Precision and Memory: We examine the use of mixed-precision formats, particularly BFloat16, to reduce memory overhead and accelerate computation without sacrificing model quality.
Fine-tuning: You will learn strategies for efficiently adapting large, pre-trained MoE models for downstream tasks, a common and practical use case.

The chapter concludes with a hands-on exercise where you will configure a distributed training job, applying these techniques to a large-scale MoE model.

Sections

3.1 Expert Parallelism for Distributed Training
3.2 Combining Model, Data, and Expert Parallelism
3.3 Capacity Factor and its Impact on Performance
3.4 Techniques for Mitigating Router Z-Loss Instability
3.5 Precision and its Effects: BFloat16 Training
3.6 Fine-tuning Strategies for Pre-trained MoE Models
3.7 Practice: Configuring a Distributed Training Job