All Courses

Mixture of Experts: Core Concepts and Hands-on Implementation

Chapter 1: Foundations of Mixture of Experts Models

Overview of Sparsely-Gated MoE Architecture

The Gating Network: Formulation and Function

Expert Networks: Specialization and Capacity

Mathematical Formulation of the MoE Layer

Load Balancing and Auxiliary Losses

Challenges in MoE Training: Expert Collapse

Comparison with Dense Model Scaling

Hands-on: Implementing a Basic MoE Layer

Chapter 2: Advanced Routing Mechanisms

Analysis of Top-k Gating and its Variants

Noisy Top-k Gating for Load Balancing

Hash-based Routing for Deterministic Selection

Switch Transformers: Simplified Routing

Soft MoE: Differentiable Routing

Analyzing Routing Decisions and Specialization

Hands-on: Implementing Different Routing Strategies

Chapter 3: Training and Optimization of Large-Scale MoEs

Expert Parallelism for Distributed Training

Combining Model, Data, and Expert Parallelism

Capacity Factor and its Impact on Performance

Techniques for Mitigating Router Z-Loss Instability

Precision and its Effects: BFloat16 Training

Fine-tuning Strategies for Pre-trained MoE Models

Practice: Configuring a Distributed Training Job

Chapter 4: Efficient Inference with MoE Models

Inference Challenges: Memory and Latency

Expert Offloading to CPU or NVMe

Batching Strategies for Sparse Activation

Model Distillation for MoE Compression

Quantization Techniques for MoE Layers

Speculative Decoding with MoE Models

Hands-on: Building an Optimized Inference Pipeline

Chapter 5: Integrating MoE into Modern Architectures

Replacing FFNs with MoE Layers in Transformers

Placement of MoE Layers: Frequency and Location

MoE in Vision Transformers (ViT)

MoE in Multi-modal Models

Architectural Variants and their Properties

Analyzing Parameter vs. FLOPs Trade-offs

Practice: Modifying a Transformer to use MoE

Soft MoE: Differentiable Routing

Was this section helpful?

References

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017 arXiv preprint arXiv:1701.06538 DOI: 10.48550/arXiv.1701.06538 - This foundational paper introduces the sparsely-gated Mixture-of-Experts (MoE) layer, providing context by contrasting it with dense (soft) gating and discussing the computational advantages of sparsity.
Learning to Route: A Differentiable Approach to Mixture of Experts, Clemens Rosenbaum, Chetan Sanan, Charith Gunasekara, Josh Trani, Andrew Gordon Wilson, Kyunghyun Cho, 2018 Proceedings of the 35th International Conference on Machine Learning (ICML), Vol. 80 (PMLR) DOI: 10.5555/3326938.3326955 - This paper presents a methodology for achieving differentiable routing in Mixture of Experts models, aligning directly with the section's discussion of soft routing.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems (NeurIPS) 30 DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture and its self-attention mechanism, which serves as an excellent analogy for the weighted sum computation in soft routing.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, Noam Shazeer, 2021 arXiv DOI: 10.48550/arXiv.2101.03961 - Presents Switch Transformers, a prominent example of hard-gating MoE, demonstrating the computational efficiency of sparsity that Soft MoE diverges from.

© 2025 ApX Machine LearningEngineered with