Scaling deep learning models traditionally involves making every component larger, which increases computational costs significantly. Mixture of Experts (MoE) models present a different approach. They increase model capacity by incorporating a collection of specialized sub-networks, known as experts, and dynamically route each input token to a small subset of them. This allows for a massive increase in the number of parameters without a proportional increase in the required computation for a single forward pass.
This chapter establishes the core principles of these sparse architectures. We will start by dissecting the main components: the gating network, which learns how to route inputs, and the expert networks, which develop specialized functions. We will then examine the complete mathematical formulation of an MoE layer, whose output y(x) for an input x is a sparse, weighted sum of expert outputs:
y(x)=i=1∑Ng(x)iEi(x)In this formulation, g(x)i is the gating value for expert i and Ei(x) is the expert's output. You will learn how sparsity is enforced, typically by ensuring most g(x)i values are zero for any given input.
Training these models comes with unique considerations. We will cover the auxiliary loss functions used for load balancing, which encourage an even distribution of tokens across all experts. This is a necessary technique for preventing common training instabilities like expert collapse, where a few experts are over-utilized while others receive no training. To conclude the chapter, we will apply these ideas by building a basic MoE layer from scratch.
1.1 Overview of Sparsely-Gated MoE Architecture
1.2 The Gating Network: Formulation and Function
1.3 Expert Networks: Specialization and Capacity
1.4 Mathematical Formulation of the MoE Layer
1.5 Load Balancing and Auxiliary Losses
1.6 Challenges in MoE Training: Expert Collapse
1.7 Comparison with Dense Model Scaling
1.8 Hands-on: Implementing a Basic MoE Layer
© 2026 ApX Machine LearningEngineered with