All Courses

Mixture of Experts: Core Concepts and Hands-on Implementation

Chapter 1: Foundations of Mixture of Experts Models

Overview of Sparsely-Gated MoE Architecture

The Gating Network: Formulation and Function

Expert Networks: Specialization and Capacity

Mathematical Formulation of the MoE Layer

Load Balancing and Auxiliary Losses

Challenges in MoE Training: Expert Collapse

Comparison with Dense Model Scaling

Hands-on: Implementing a Basic MoE Layer

Chapter 2: Advanced Routing Mechanisms

Analysis of Top-k Gating and its Variants

Noisy Top-k Gating for Load Balancing

Hash-based Routing for Deterministic Selection

Switch Transformers: Simplified Routing

Soft MoE: Differentiable Routing

Analyzing Routing Decisions and Specialization

Hands-on: Implementing Different Routing Strategies

Chapter 3: Training and Optimization of Large-Scale MoEs

Expert Parallelism for Distributed Training

Combining Model, Data, and Expert Parallelism

Capacity Factor and its Impact on Performance

Techniques for Mitigating Router Z-Loss Instability

Precision and its Effects: BFloat16 Training

Fine-tuning Strategies for Pre-trained MoE Models

Practice: Configuring a Distributed Training Job

Chapter 4: Efficient Inference with MoE Models

Inference Challenges: Memory and Latency

Expert Offloading to CPU or NVMe

Batching Strategies for Sparse Activation

Model Distillation for MoE Compression

Quantization Techniques for MoE Layers

Speculative Decoding with MoE Models

Hands-on: Building an Optimized Inference Pipeline

Chapter 5: Integrating MoE into Modern Architectures

Replacing FFNs with MoE Layers in Transformers

Placement of MoE Layers: Frequency and Location

MoE in Vision Transformers (ViT)

MoE in Multi-modal Models

Architectural Variants and their Properties

Analyzing Parameter vs. FLOPs Trade-offs

Practice: Modifying a Transformer to use MoE

Precision and its Effects: BFloat16 Training

Was this section helpful?

References

BFloat16: The Secret to High Performance on Cloud TPUs, Shibo Wang, Pankaj Kanwar, 2019 (Google Cloud Blog) - Explains BFloat16's design, its role for deep learning due to its wide dynamic range, and its benefits for training large models efficiently on specialized hardware.
Mixed-Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2018 ICLR 2018 DOI: 10.48550/arXiv.1710.03740 - Presents techniques for mixed-precision training, including the use of FP32 master weights and FP16 computations, a framework that BFloat16 training extends.
Automatic Mixed Precision training, PyTorch Developers, 2025 (PyTorch.org) - The official PyTorch documentation for Automatic Mixed Precision (AMP), detailing how to use torch.autocast for efficient BFloat16 training.
High-Performance Mixed-Precision Training for Deep Learning, Minseok Park, George K. Lee, Yunsup Lee, Michael O. Lee, 2019 2019 IEEE High Performance Extreme Computing Conference (HPEC) (IEEE) DOI: 10.1109/HPEC.2019.8916335 - Discusses the implementation and advantages of mixed-precision training, including BFloat16, within the context of hardware accelerators for deep learning.

© 2025 ApX Machine LearningEngineered with