All Courses

Mixture of Experts: Core Concepts and Hands-on Implementation

Chapter 1: Foundations of Mixture of Experts Models

Overview of Sparsely-Gated MoE Architecture

The Gating Network: Formulation and Function

Expert Networks: Specialization and Capacity

Mathematical Formulation of the MoE Layer

Load Balancing and Auxiliary Losses

Challenges in MoE Training: Expert Collapse

Comparison with Dense Model Scaling

Hands-on: Implementing a Basic MoE Layer

Chapter 2: Advanced Routing Mechanisms

Analysis of Top-k Gating and its Variants

Noisy Top-k Gating for Load Balancing

Hash-based Routing for Deterministic Selection

Switch Transformers: Simplified Routing

Soft MoE: Differentiable Routing

Analyzing Routing Decisions and Specialization

Hands-on: Implementing Different Routing Strategies

Chapter 3: Training and Optimization of Large-Scale MoEs

Expert Parallelism for Distributed Training

Combining Model, Data, and Expert Parallelism

Capacity Factor and its Impact on Performance

Techniques for Mitigating Router Z-Loss Instability

Precision and its Effects: BFloat16 Training

Fine-tuning Strategies for Pre-trained MoE Models

Practice: Configuring a Distributed Training Job

Chapter 4: Efficient Inference with MoE Models

Inference Challenges: Memory and Latency

Expert Offloading to CPU or NVMe

Batching Strategies for Sparse Activation

Model Distillation for MoE Compression

Quantization Techniques for MoE Layers

Speculative Decoding with MoE Models

Hands-on: Building an Optimized Inference Pipeline

Chapter 5: Integrating MoE into Modern Architectures

Replacing FFNs with MoE Layers in Transformers

Placement of MoE Layers: Frequency and Location

MoE in Vision Transformers (ViT)

MoE in Multi-modal Models

Architectural Variants and their Properties

Analyzing Parameter vs. FLOPs Trade-offs

Practice: Modifying a Transformer to use MoE

Expert Offloading to CPU or NVMe

Was this section helpful?

References

Big Model Inference, Hugging Face, 2024 (Hugging Face) - Official documentation for Hugging Face Accelerate, explaining practical methods for offloading large model parameters to CPU and disk to enable inference on memory-constrained hardware.
GPUDirect Storage, NVIDIA, 2024 (NVIDIA) - NVIDIA's technical overview of GPUDirect Storage, explaining how it enables a direct data path between NVMe storage and GPU memory, enhancing data transfer speeds for offloading.

© 2025 ApX Machine LearningEngineered with