This chapter lays the groundwork for understanding Mixture of Experts (MoE) models. We begin by examining the principles of conditional computation, the core idea enabling sparse models to scale efficiently. You will learn how MoEs differ from standard dense networks by selectively activating only a fraction of their parameters ('experts') for each input.
We will define the basic structure of an MoE layer and compare the computational characteristics of sparse versus dense activation patterns. The chapter concludes by presenting the mathematical formulation for a standard MoE layer, detailing how input tokens are routed via a gating network G(x) to specific expert networks Ei(x) and how their outputs are combined. This combination is often represented as:
y=∑i=1NG(x)iEi(x)
where G(x)i represents the gating decision or weight for expert i, and N is the total number of experts. Understanding these fundamentals is necessary before tackling advanced MoE architectures and training procedures in subsequent chapters.
1.1 Conditional Computation Principles
1.2 The Sparse MoE Paradigm
1.3 Contrasting Dense vs. Sparse Activation
1.4 Mathematical Formulation of Basic MoE Layers
© 2025 ApX Machine Learning