All Courses

Mixture of Experts: Advanced Architecture, Training, and Scaling

Chapter 1: Foundations of Sparse Expert Models

Conditional Computation Principles

The Sparse MoE Approach

Contrasting Dense vs. Sparse Activation

Mathematical Formulation of Basic MoE Layers

Chapter 2: Advanced MoE Architectures

Designing Effective Gating Networks

Hierarchical MoE Structures

Router Architectures: Linear, Non-Linear, Attention-Based

Expert Capacity and Sizing Considerations

Stabilization Techniques for Routers

Hands-on Practical: Implementing Custom Gating Mechanisms

Chapter 3: Training Dynamics and Optimization

The Load Balancing Problem in MoE

Auxiliary Loss Functions for Load Balancing

Router Optimization Strategies

Handling Dropped Tokens

Expert Specialization Collapse and Prevention

Impact of Optimizer Choice and Hyperparameters

Hands-on Practical: Implementing and Tuning Load Balancing Losses

Chapter 4: Scaling MoE Models: Distributed Training

Challenges in Distributed MoE Training

Expert Parallelism: Distributing Experts Across Devices

Integrating Expert Parallelism with Data Parallelism

All-to-All Communication Patterns

Pipeline Parallelism for MoE Models

Communication Optimization Techniques (e.g., Overlapping)

Frameworks and Libraries for Distributed MoE (e.g., DeepSpeed, Tutel)

Practice: Configuring Distributed MoE Training

Chapter 5: Inference Optimization and Deployment

Inference Challenges with Sparse Models

Batching Strategies for MoE Inference

Model Compression Techniques for MoE

Hardware Acceleration Considerations

Router Caching and Optimization

Deployment Patterns for Large Sparse Models

Hands-on Practical: Profiling MoE Inference

Router Architectures: Linear, Non-Linear, Attention-Based

Was this section helpful?

References

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017 arXiv preprint arXiv:1701.06538 DOI: 10.48550/arXiv.1701.06538 - This seminal paper introduced the modern sparsely-gated Mixture-of-Experts layer, detailing the use of a linear gating network with top-k selection and noise for load balancing, which forms the basis for linear routers.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsely Activated Transformers, William Fedus, Barret Zoph, Noam Shazeer, 2022 Journal of Machine Learning Research, Vol. 23 DOI: 10.5555/3540277.3540449 - This work demonstrated the effectiveness of simple linear routers (often top-1 or top-2) in scaling Mixture-of-Experts models to billions of parameters, highlighting their computational efficiency and practical utility.
Attention-based Experts Selection for Deep Neural Networks, Jung-Min Kim, Jong-Seok Lee, 2020 Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34 (Association for the Advancement of Artificial Intelligence (AAAI)) DOI: 10.1609/aaai.v34i04.5879 - This paper proposes an attention-based mechanism for selecting experts, where attention weights are learned to assign experts for each input, directly illustrating the concept of attention-based routers.

© 2025 ApX Machine LearningEngineered with