The standard mechanism for routing tokens in a sparse Mixture of Experts model is Top-k Gating. Its design is straightforward and effective, forming the basis for many successful MoE architectures. The gating network, a simple linear layer, produces a logit for each expert, representing a preference for that expert to process the current input token. The TopK function then selects the k experts with the highest logit scores.
For a given input token representation , the process begins with the gating network, parameterized by a weight matrix , which computes the logits :
Here, is a vector where each element corresponds to the logit for expert . The TopK operation selects the indices of the k highest values in . Let's call this set of indices .
The final gating scores, , are typically calculated by applying a softmax function only to the selected logits. This ensures the weights for the chosen experts sum to one.
The output of the MoE layer, , is then the weighted sum of the outputs from the selected experts:
The hyperparameter is a significant design choice. Using allows each token to be processed by two experts, providing a richer representation than routing to a single expert. This can improve model quality at the cost of increased computation. In contrast, setting , as popularized by Switch Transformers, minimizes computational and communication overhead.
The gating process for a single token. The gating network computes logits, and the
TopKfunction selects the two experts with the highest scores (Expert 2 and Expert 4) to process the token.
While functionally simple, Top-k gating has a significant operational weakness: it can lead to severe load imbalance. During training, the gating network may learn to favor a small subset of "popular" experts, while others remain underutilized or are rarely selected. This phenomenon arises because the gating network is a learned function, and without constraints, it will optimize for predictive accuracy, which may involve repeatedly using the same few experts that prove most effective early in training.
This imbalance introduces two major problems:
A snapshot of token distribution across eight experts in a single training batch. Experts 2 and 5 are heavily overloaded, while the others are underutilized, indicating poor load balance.
To manage the flow of tokens and prevent any single expert from being overwhelmed, MoE implementations introduce a capacity factor. This hyperparameter defines a buffer for how many tokens each expert can process in a batch. The capacity for each expert is defined as:
A capacity_factor greater than 1.0 provides a buffer for minor imbalances. For example, a value of 1.25 allows each expert to process up to 25% more tokens than the average.
However, if a popular expert receives more tokens than its capacity , the excess tokens are "dropped." These dropped tokens do not get processed by any expert. Instead, they bypass the MoE layer entirely and are passed through the residual connection. This is a form of information loss and can degrade model performance, as the model is unable to apply specialized computation to those tokens. Finding the right capacity_factor is a trade-off: a higher value reduces the number of dropped tokens but increases memory allocation and can lead to more wasted computation if the load is balanced.
The primary mechanism for combating load imbalance in a standard Top-k system is the auxiliary load balancing loss, which we introduced in Chapter 1. This loss term penalizes imbalanced routing decisions, encouraging the gating network to spread tokens more evenly across all available experts. It is an essential component for stabilizing training and making Top-k gating viable in practice. However, as we will see in the following sections, modifying the routing algorithm itself can provide more direct and powerful solutions to this challenge.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with