The Sparse Mixture of Experts (MoE) paradigm offers a structured approach to implementing conditional computation, allowing neural networks to scale their parameter count significantly without a proportional increase in computational cost per input. Building upon the general idea of activating only relevant parts of a network, the MoE architecture provides a concrete mechanism for achieving this sparsity, particularly within the context of large transformer models.
At its core, an MoE layer replaces a standard component, like a Feed-Forward Network (FFN) block in a transformer, with two main parts:
The sparsity arises from the router's selection mechanism. Instead of sending every token to every expert (which would be computationally expensive, resembling an ensemble), the router typically employs a sparse selection strategy. The most common approach is top-k routing, where for each input token, the router calculates an affinity score for each of the N experts. It then selects only the k experts with the highest scores to process the token, where k is a small integer, often 1 or 2, and significantly less than N (k≪N).
A conceptual view of a Sparse MoE layer. An input token is processed by the gating network, which selects and weights a small subset of experts (Experts 1 and k in this example, indicated by solid green lines). Inactive experts (dashed gray lines) are bypassed for this token. The outputs of the active experts are combined to produce the final output.
For a specific token x, only the chosen top-k experts perform computations (Ei(x) for i in the selected set). The remaining N−k experts are inactive for this token, contributing zero computational load (FLOPs) for its processing. This allows the total parameter count of the model (sum of parameters in the router and all N experts) to be very large, while the computational cost per token remains controlled, scaling only with k and the size of a single expert, not with the total number of experts N.
The final output y of the MoE layer for token x is typically a weighted combination of the outputs from the activated experts. The weights are also determined by the gating network's scores. Recall the general formulation:
y=∑i=1NG(x)iEi(x)
In the sparse top-k scenario, G(x)i is non-zero only for the k selected experts, effectively making the sum sparse:
y=∑i∈TopK(G(x))G(x)iEi(x)
Here, TopK(G(x)) represents the indices of the k experts selected by the router G for input x, and G(x)i is the learned weight assigned to expert i's output.
This sparse MoE paradigm presents a powerful mechanism for decoupling model size (parameter count) from computational cost (FLOPs per token). It enables the construction of models with trillions of parameters while maintaining a manageable computational budget during training and inference. However, realizing the potential benefits requires addressing specific challenges inherent to this architecture, such as ensuring balanced utilization of experts and stable training dynamics for the router, which are central topics in subsequent chapters.
© 2025 ApX Machine Learning