All Courses

Mixture of Experts: Core Concepts and Hands-on Implementation

Chapter 1: Foundations of Mixture of Experts Models

Overview of Sparsely-Gated MoE Architecture

The Gating Network: Formulation and Function

Expert Networks: Specialization and Capacity

Mathematical Formulation of the MoE Layer

Load Balancing and Auxiliary Losses

Challenges in MoE Training: Expert Collapse

Comparison with Dense Model Scaling

Hands-on: Implementing a Basic MoE Layer

Chapter 2: Advanced Routing Mechanisms

Analysis of Top-k Gating and its Variants

Noisy Top-k Gating for Load Balancing

Hash-based Routing for Deterministic Selection

Switch Transformers: Simplified Routing

Soft MoE: Differentiable Routing

Analyzing Routing Decisions and Specialization

Hands-on: Implementing Different Routing Strategies

Chapter 3: Training and Optimization of Large-Scale MoEs

Expert Parallelism for Distributed Training

Combining Model, Data, and Expert Parallelism

Capacity Factor and its Impact on Performance

Techniques for Mitigating Router Z-Loss Instability

Precision and its Effects: BFloat16 Training

Fine-tuning Strategies for Pre-trained MoE Models

Practice: Configuring a Distributed Training Job

Chapter 4: Efficient Inference with MoE Models

Inference Challenges: Memory and Latency

Expert Offloading to CPU or NVMe

Batching Strategies for Sparse Activation

Model Distillation for MoE Compression

Quantization Techniques for MoE Layers

Speculative Decoding with MoE Models

Hands-on: Building an Optimized Inference Pipeline

Chapter 5: Integrating MoE into Modern Architectures

Replacing FFNs with MoE Layers in Transformers

Placement of MoE Layers: Frequency and Location

MoE in Vision Transformers (ViT)

MoE in Multi-modal Models

Architectural Variants and their Properties

Analyzing Parameter vs. FLOPs Trade-offs

Practice: Modifying a Transformer to use MoE

Analyzing Routing Decisions and Specialization

Was this section helpful?

References

Switch Transformers: Scaling to Trillion Parameter Models with Sparsely Activated Expert Layers, William Fedus, Barret Zoph, Noam Shazeer, 2022 Journal of Machine Learning Research, Vol. 23 (Microtome Publishing) DOI: 10.5555/3540306.3540316 - Describes an efficient Mixture of Experts architecture that introduced auxiliary losses (including load balancing and router z-loss) to improve training stability and expert utilization, which aids in analyzing routing behavior.
Sparsely Gated Mixture of Experts Layers, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017 Advances in Neural Information Processing Systems, Vol. 30 (Neural Information Processing Systems Foundation, Inc. (NeurIPS)) DOI: 10.48550/arXiv.1701.06538 - A foundational work that introduced sparsely gated Mixture of Experts layers to deep learning, discussing the challenges in training MoE models and highlights expert utilization and balanced routing.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Dmitry Lepikhin, Hieu Pham, Yanqi Zhou, Zhifeng Chen, Jeff Dean, 2021 International Conference on Learning Representations (ICLR) (ICLR) DOI: 10.48550/arXiv.2006.16668 - Presents an early large-scale MoE system that introduced several techniques for training conditional computation models, including auxiliary losses for load balancing and efficient distributed training.

© 2025 ApX Machine LearningEngineered with