All Courses

Mixture of Experts: Core Concepts and Hands-on Implementation

Chapter 1: Foundations of Mixture of Experts Models

Overview of Sparsely-Gated MoE Architecture

The Gating Network: Formulation and Function

Expert Networks: Specialization and Capacity

Mathematical Formulation of the MoE Layer

Load Balancing and Auxiliary Losses

Challenges in MoE Training: Expert Collapse

Comparison with Dense Model Scaling

Hands-on: Implementing a Basic MoE Layer

Chapter 2: Advanced Routing Mechanisms

Analysis of Top-k Gating and its Variants

Noisy Top-k Gating for Load Balancing

Hash-based Routing for Deterministic Selection

Switch Transformers: Simplified Routing

Soft MoE: Differentiable Routing

Analyzing Routing Decisions and Specialization

Hands-on: Implementing Different Routing Strategies

Chapter 3: Training and Optimization of Large-Scale MoEs

Expert Parallelism for Distributed Training

Combining Model, Data, and Expert Parallelism

Capacity Factor and its Impact on Performance

Techniques for Mitigating Router Z-Loss Instability

Precision and its Effects: BFloat16 Training

Fine-tuning Strategies for Pre-trained MoE Models

Practice: Configuring a Distributed Training Job

Chapter 4: Efficient Inference with MoE Models

Inference Challenges: Memory and Latency

Expert Offloading to CPU or NVMe

Batching Strategies for Sparse Activation

Model Distillation for MoE Compression

Quantization Techniques for MoE Layers

Speculative Decoding with MoE Models

Hands-on: Building an Optimized Inference Pipeline

Chapter 5: Integrating MoE into Modern Architectures

Replacing FFNs with MoE Layers in Transformers

Placement of MoE Layers: Frequency and Location

MoE in Vision Transformers (ViT)

MoE in Multi-modal Models

Architectural Variants and their Properties

Analyzing Parameter vs. FLOPs Trade-offs

Practice: Modifying a Transformer to use MoE

The Gating Network: Formulation and Function

Was this section helpful?

References

Hierarchical mixtures of experts and the EM algorithm, Michael I. Jordan, Robert A. Jacobs, 1994 Neural Computation, Vol. 6 (MIT Press) DOI: 10.1162/neco.1994.6.2.181 - Establishes the foundational framework for Mixture of Experts models, including the role of gating networks in directing data flow.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1701.06538 - Introduces the sparsely-gated Mixture-of-Experts layer, a significant contribution to scaling MoE models, detailing top-k routing and load balancing.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph and Noam Shazeer, 2022 The Journal of Machine Learning Research, Vol. 23 (Microtome Publishing) DOI: 10.5555/3618408.3618585 - Describes the practical application and scaling of sparsely-activated MoE layers within Transformer models, offering insights into routing and training stability.

© 2025 ApX Machine LearningEngineered with