All Courses

Mixture of Experts: Core Concepts and Hands-on Implementation

Chapter 1: Foundations of Mixture of Experts Models

Overview of Sparsely-Gated MoE Architecture

The Gating Network: Formulation and Function

Expert Networks: Specialization and Capacity

Mathematical Formulation of the MoE Layer

Load Balancing and Auxiliary Losses

Challenges in MoE Training: Expert Collapse

Comparison with Dense Model Scaling

Hands-on: Implementing a Basic MoE Layer

Chapter 2: Advanced Routing Mechanisms

Analysis of Top-k Gating and its Variants

Noisy Top-k Gating for Load Balancing

Hash-based Routing for Deterministic Selection

Switch Transformers: Simplified Routing

Soft MoE: Differentiable Routing

Analyzing Routing Decisions and Specialization

Hands-on: Implementing Different Routing Strategies

Chapter 3: Training and Optimization of Large-Scale MoEs

Expert Parallelism for Distributed Training

Combining Model, Data, and Expert Parallelism

Capacity Factor and its Impact on Performance

Techniques for Mitigating Router Z-Loss Instability

Precision and its Effects: BFloat16 Training

Fine-tuning Strategies for Pre-trained MoE Models

Practice: Configuring a Distributed Training Job

Chapter 4: Efficient Inference with MoE Models

Inference Challenges: Memory and Latency

Expert Offloading to CPU or NVMe

Batching Strategies for Sparse Activation

Model Distillation for MoE Compression

Quantization Techniques for MoE Layers

Speculative Decoding with MoE Models

Hands-on: Building an Optimized Inference Pipeline

Chapter 5: Integrating MoE into Modern Architectures

Replacing FFNs with MoE Layers in Transformers

Placement of MoE Layers: Frequency and Location

MoE in Vision Transformers (ViT)

MoE in Multi-modal Models

Architectural Variants and their Properties

Analyzing Parameter vs. FLOPs Trade-offs

Practice: Modifying a Transformer to use MoE

Quantization Techniques for MoE Layers

Was this section helpful?

References

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, 2022 arXiv preprint arXiv:2210.17323 DOI: 10.48550/arXiv.2210.17323 - Introduces GPTQ, a post-training quantization method for large language models, relevant for quantizing experts.
QLoRA: Efficient Finetuning of Quantized LLMs on Consumer GPUs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023 arXiv preprint arXiv:2305.14314 DOI: 10.48550/arXiv.2305.14314 - Presents NF4 data type and 4-bit quantization, enabling efficient finetuning of large language models.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022 NeurIPS 2022 DOI: 10.48550/arXiv.2208.07339 - Describes 8-bit matrix multiplication and quantization techniques used in the bitsandbytes library for large language models.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han, 2023 MLSys 2024 DOI: 10.48550/arXiv.2306.00978 - Proposes AWQ, an activation-aware weight quantization method that maintains model performance by protecting important weights.
Quantization (bitsandbytes), Hugging Face, 2024 (Hugging Face documentation) - Official documentation explaining how to apply quantization with bitsandbytes within the Hugging Face Transformers library.

© 2025 ApX Machine LearningEngineered with