As large neural networks, particularly Transformers, continue to grow in parameter count, the computational cost associated with training and inference becomes a significant bottleneck. Standard dense architectures require every parameter to participate in the computation for every input token. This coupling between model size (parameter count) and computational load (FLOPs) limits practical scalability. If we double the parameters in a dense layer, we often double the computation for each input passing through it.
Conditional computation offers an alternative approach. The fundamental idea is to activate only a subset of the model's parameters for any given input, based on the input itself. Instead of processing information through a monolithic block of computation, the network dynamically selects specialized computational pathways.
Imagine a vast network containing numerous specialized subnetworks or 'experts'. For a specific input token (or sequence), perhaps only a few of these experts possess the relevant knowledge or function to process it effectively. Conditional computation allows the model to identify and invoke only these relevant experts, leaving the others inactive.
This approach draws inspiration from biological systems where neuronal activation is sparse. Computationally, the primary benefit is the decoupling of model capacity from per-input computational cost. We can significantly increase the total number of parameters in the model (adding more experts) without proportionally increasing the FLOPs required to process a single token.
Consider a simplified comparison:
A conceptual comparison of data flow in a dense layer versus a conditional computation setup. In the conditional path, the router selectively activates only certain experts (Expert 1 and K shown here) for a given input.
The primary advantage stems from this separation of concerns:
This theoretical separation is visualized below, comparing how parameter count and computational cost might scale.
Comparison of how parameter count and per-token computational cost scale in dense versus conditional computation models (assuming a fixed number of activated experts k in the conditional case). Note the logarithmic scale on the Y-axis. Conditional computation allows parameter count to grow without a proportional increase in per-token compute.
Mixture of Experts (MoE) is a direct and effective implementation of the conditional computation principle within deep learning architectures, particularly Transformers. In an MoE layer, the 'experts' are typically feed-forward networks (FFNs), and the routing mechanism is a small trainable neural network called the 'gating network' or 'router'.
The router, G(x), examines the input representation x (usually the output of a self-attention layer in a Transformer) and produces probabilities or weights indicating which experts should process this input. The outputs of the selected experts, Ei(x), are then combined, often weighted by the router's scores, as introduced in the chapter overview:
y=∑i=1NG(x)iEi(x)
Here, G(x)i embodies the conditional aspect. Ideally, for a given x, most G(x)i values are zero (or near zero), ensuring only a sparse subset of experts Ei contribute significantly to the output y.
While elegant in principle, realizing the benefits of conditional computation through MoE introduces practical challenges related to routing decisions, load balancing across experts, and efficient implementation in distributed settings. These complexities are the focus of subsequent chapters. Understanding the core principle of conditional computation, however, provides the necessary foundation for appreciating the design choices and optimization strategies employed in modern MoE models.
© 2025 ApX Machine Learning