While techniques like quantization and pruning reduce the computational or memory cost of existing model parameters, conditional computation fundamentally changes which parameters are used for a given input. The goal is to activate only a subset of the model's parameters tailored to the specific input, thereby reducing the floating-point operations (FLOPs) required per inference step, even if the total parameter count remains large or even increases. This contrasts sharply with standard dense models where nearly all parameters participate in processing every single input token.
Mixture-of-Experts (MoE) is the most prominent and successful architecture embodying conditional computation, particularly within the feed-forward network (FFN) layers of Transformer models.
An MoE layer replaces a standard dense FFN layer with several components:
A conceptual view of an MoE layer. The gating network routes the input token representation to a subset of experts (e.g., Expert 1 and Expert M). Their outputs are combined based on the gating scores to produce the final output representation.
The most common routing strategy is Top-k routing, where the gating network outputs a score for each expert, and the k experts with the highest scores are selected to process the token. Often, k is small (e.g., k=1 as in Switch Transformers, or k=2). The final output is the weighted sum of the outputs of these top-k experts, where the weights are derived from the normalized gating scores.
MoE achieves computational efficiency through sparse activation. While the total number of parameters across all experts can be substantial (leading to high model capacity), only the parameters within the selected k experts and the gating network are activated for any given token.
Consider an FFN layer with dimension dmodel and hidden dimension dff. The FLOPs are approximately 2×dmodel×dff. In an MoE layer with M experts and top-k routing (k≪M), the FLOPs per token are roughly:
FLOPsMoE≈FLOPsRouter+k×(2×dmodel×dff)Since the router is small, and k is typically small, the computational cost per token is significantly lower than a dense model with an equivalent total number of parameters (M×(2×dmodel×dff)). MoE effectively decouples the parameter count from the per-token computational cost.
Comparison of how inference FLOPs per token scale with total model parameters for dense vs. MoE models. MoE allows much larger parameter counts for a slower increase in per-token FLOPs. (Values are illustrative).
Training MoE models presents unique challenges:
MoE is fundamentally an architectural approach to efficiency. It can be combined with other techniques:
However, MoE primarily tackles the FLOPs-per-token aspect by increasing parameter count while sparsifying activation. This makes it distinct from methods that reduce the cost of existing dense computations.
While top-k routing for FFNs is the most common form, conditional computation ideas extend further. Switch Transformers popularized the extreme k=1 case, routing each token to only a single expert. Research also explores conditional computation in other parts of the Transformer, like attention mechanisms, although MoE for FFN layers remains the most widely adopted technique.
Choosing MoE involves trade-offs:
In summary, MoE represents a significant architectural shift towards efficient scaling of LLMs. By activating only a fraction of parameters per input token, it allows models to grow much larger in terms of parameter count without a proportional increase in computational cost per token, offering a path towards more capable models under constrained inference budgets, albeit with substantial system-level and training challenges. Understanding MoE is increasingly important as it forms the backbone of several state-of-the-art large language models.
© 2025 ApX Machine Learning