Having established the conceptual foundation of Mixture of Experts (MoE) layers and their contrast with dense architectures, we now formalize the computations involved in a basic MoE layer. This mathematical description clarifies precisely how input data flows through the system, how routing decisions are made, and how expert outputs are combined. Understanding this process is fundamental before exploring more complex architectural variants and training optimizations.
Consider an MoE layer integrated within a larger network, such as a Transformer block. This layer receives an input representation for each token in a sequence. Let x∈Rd represent the input embedding for a single token, where d is the model's hidden dimension. The MoE layer consists of N expert networks, denoted as E1,E2,…,EN, and a gating network G.
The gating network's primary function is to determine which expert(s) should process the input token x. A common implementation uses a simple linear transformation followed by a softmax function.
Data flow through the gating network, transforming the input token representation x into expert assignment scores G(x).
Each expert Ei is typically a neural network module, often a standard Feed-Forward Network (FFN) similar to those found in Transformer blocks. Critically, each expert Ei possesses its own distinct set of parameters. For an input token x, the output of the i-th expert is denoted as Ei(x). While structurally similar, the parameter differentiation allows each expert to potentially specialize in processing different types of input patterns or performing different sub-tasks.
While the gating network produces a score G(x)i for every expert, sparse MoE models leverage conditional computation by activating only a subset of these experts for each token. This is commonly achieved using Top-k routing, where only the k experts with the highest gating scores are selected (typically k=1 or k=2).
This formulation ensures that only a small fraction (k/N) of the experts' parameters are engaged for any given input token, leading to significant computational savings during the forward pass compared to activating all N experts or using a single large expert.
Forward pass for a single token x in an MoE layer with k=2 routing. The gating network produces scores, Top-k selects experts (E1 and EN shown here), the input is processed only by selected experts, and their outputs are combined using the gating scores.
In practice, MoE layers operate on batches of tokens. For a batch of T tokens, the input is typically represented as a matrix X∈RT×d. The gating network computes scores G(X)∈RT×N for all tokens simultaneously. Crucially, the Top-k selection is performed independently for each token (row) in the batch. This means different tokens within the same batch can be routed to different sets of experts.
While efficient, this independent routing per token introduces challenges, particularly regarding load balancing: ensuring that experts receive a roughly equal amount of computation across the batch. If the gating network consistently routes most tokens to a small subset of experts, others remain underutilized, diminishing the benefits of specialization and potentially destabilizing training. We will address these training dynamics and associated optimization techniques in Chapter 3.
This mathematical framework provides the essential building blocks for understanding how basic MoE layers perform conditional computation. The interplay between the gating network, sparse routing mechanism, and expert networks forms the core of the MoE paradigm.
© 2025 ApX Machine Learning