All Courses

Mixture of Experts: Core Concepts and Hands-on Implementation

Chapter 1: Foundations of Mixture of Experts Models

Overview of Sparsely-Gated MoE Architecture

The Gating Network: Formulation and Function

Expert Networks: Specialization and Capacity

Mathematical Formulation of the MoE Layer

Load Balancing and Auxiliary Losses

Challenges in MoE Training: Expert Collapse

Comparison with Dense Model Scaling

Hands-on: Implementing a Basic MoE Layer

Chapter 2: Advanced Routing Mechanisms

Analysis of Top-k Gating and its Variants

Noisy Top-k Gating for Load Balancing

Hash-based Routing for Deterministic Selection

Switch Transformers: Simplified Routing

Soft MoE: Differentiable Routing

Analyzing Routing Decisions and Specialization

Hands-on: Implementing Different Routing Strategies

Chapter 3: Training and Optimization of Large-Scale MoEs

Expert Parallelism for Distributed Training

Combining Model, Data, and Expert Parallelism

Capacity Factor and its Impact on Performance

Techniques for Mitigating Router Z-Loss Instability

Precision and its Effects: BFloat16 Training

Fine-tuning Strategies for Pre-trained MoE Models

Practice: Configuring a Distributed Training Job

Chapter 4: Efficient Inference with MoE Models

Inference Challenges: Memory and Latency

Expert Offloading to CPU or NVMe

Batching Strategies for Sparse Activation

Model Distillation for MoE Compression

Quantization Techniques for MoE Layers

Speculative Decoding with MoE Models

Hands-on: Building an Optimized Inference Pipeline

Chapter 5: Integrating MoE into Modern Architectures

Replacing FFNs with MoE Layers in Transformers

Placement of MoE Layers: Frequency and Location

MoE in Vision Transformers (ViT)

MoE in Multi-modal Models

Architectural Variants and their Properties

Analyzing Parameter vs. FLOPs Trade-offs

Practice: Modifying a Transformer to use MoE

Techniques for Mitigating Router Z-Loss Instability

Was this section helpful?

References

Sparsely-Gated Mixture-of-Experts Layers, Noam M. Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, Jeffrey Dean, 2017 Advances in Neural Information Processing Systems 30 (NeurIPS 2017) DOI: 10.48550/arXiv.1701.06538 - This seminal paper introduces the Mixture of Experts architecture and the concept of an auxiliary load balancing loss, which is fundamental to the router's operation and the origin of potential z-loss.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, Noam Shazeer, 2022 The Journal of Machine Learning Research, Vol. 23 (JMLR, Inc. and Microtome Publishing) DOI: 10.5555/3547192.3547209 - This paper details the practical challenges and solutions for training large-scale Mixture of Experts models, including discussions on the implementation and tuning of the load balancing loss (from which router z-loss derives) to ensure stable training.
Stable and Efficient Training of Sparse Mixture-of-Experts Models, Zonglin Yang, Zhiqiang Shen, Xiaodan Liang, Shanshan Zhang, Junjie Yan, Xian-Sheng Hua, and Deng Cai, 2023 International Conference on Learning Representations (ICLR 2023) (ACM) DOI: 10.5555/3587498.3587572 - This paper specifically addresses numerical instability issues in training sparse Mixture-of-Experts models, providing in-depth analysis and mitigation techniques directly relevant to managing router z-loss.

© 2025 ApX Machine LearningEngineered with