A precise mathematical model assembles the high-level components of gating and expert networks. This model details the complete forward pass of a single token through a sparse MoE layer, outlining concrete computational steps.
The process begins with the gating network, also known as the router. Its job is to determine which experts should process the current input token. The input is a token embedding, represented as a vector x∈Rd, where d is the model's hidden dimension.
The gating network itself is a simple linear layer, defined by a weight matrix Wg∈Rd×N, where N is the total number of experts. This layer projects the input token into a space of dimension N, producing a logit for each expert.
h(x)=x⋅WgThe resulting vector h(x) contains N raw scores. To convert these scores into a probability distribution, we apply the softmax function:
g(x)=softmax(h(x))The output, g(x), is a dense N-dimensional vector where each element g(x)i represents the router's confidence in assigning the token to expert i. The sum of all elements in g(x) is 1.
A dense g(x) vector implies that every expert would contribute to the output, which defeats the computational efficiency goal of MoEs. To enforce sparsity, we employ a TopK operation. Instead of using all experts, we select a small, fixed number, k, of the highest-scoring experts.
For a given token, we identify the indices of the top k values in g(x) and set all other gating values to zero. This creates a sparse gating vector, G(x). The choice of k is a critical hyperparameter. In Switch Transformers, k=1, meaning each token is routed to a single expert. A more common choice is k=2, which provides a path for learning more complex functions and adds a degree of redundancy.
This operation effectively prunes the computational graph for each token. If k=2 and we have N=64 experts, we only need to perform the forward pass for 2 of them, ignoring the other 62.
Each of the N experts is typically an independent feed-forward network (FFN). While they all share the same architecture, they do not share weights. Each expert Ei has its own set of parameters. A standard two-layer FFN expert can be written as:
Ei(x)=ReLU(x⋅W1,i)⋅W2,iHere, W1,i and W2,i are the weight matrices for the first and second linear layers of expert i, respectively. It is this collection of independent expert weights that leads to the dramatic increase in the model's total parameter count.
We can now combine these steps to define the final output, y(x), of the MoE layer. The output is the weighted sum of the outputs from the selected experts, using the sparse gating weights from the TopK operation.
Since G(x) is sparse with only k non-zero values, this summation is computationally efficient. We only need to evaluate Ei(x) for the k experts that were selected by the router.
The entire data flow for a single token can be visualized as follows:
Data flow for a single token through an MoE layer. The input
xis sent to the gating network to produce sparse weights and in parallel to the selected experts for processing.
The TopK function is non-differentiable, which poses a problem for backpropagation. In practice, this is handled by a straight-through estimator. During the forward pass, we apply the discrete TopK selection. During the backward pass, we pass the gradients through the top k gates as if the selection had been a simple multiplication. The dense gating output g(x) is used to compute the gradients for the gating weights Wg.
Additionally, after selecting the top k values from the initial softmax output g(x), their sum is no longer guaranteed to be 1. To form a proper convex combination, these k values are often re-normalized. This is typically done by applying a second softmax only to the selected top k logits from h(x). This ensures the weights used in the final summation accurately reflect their relative importance and sum to 1.
This formulation provides a model with a massive number of parameters but a constant computational cost per token, determined by k rather than N. However, this elegant structure introduces a significant challenge: if the gating network learns to route most tokens to only a few experts, the other experts will not receive training signals. This leads to the problem of expert collapse, which we address next by introducing load balancing losses.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with