The standard mechanism for routing tokens in a sparse Mixture of Experts model is Top-k Gating. Its design is straightforward and effective, forming the basis for many successful MoE architectures. The gating network, a simple linear layer, produces a logit for each expert, representing a preference for that expert to process the current input token. The TopK function then selects the k experts with the highest logit scores.The Mechanics of Top-k GatingFor a given input token representation $x$, the process begins with the gating network, parameterized by a weight matrix $W_g$, which computes the logits $h(x)$:$$ h(x) = W_g \cdot x $$Here, $h(x)$ is a vector where each element $h(x)_i$ corresponds to the logit for expert $i$. The TopK operation selects the indices of the k highest values in $h(x)$. Let's call this set of indices $S$.The final gating scores, $G(x)$, are typically calculated by applying a softmax function only to the selected logits. This ensures the weights for the chosen experts sum to one.$$ G(x)_i = \begin{cases} \frac{e^{h(x)i}}{\sum{j \in S} e^{h(x)_j}} & \text{if } i \in S \ 0 & \text{if } i \notin S \end{cases} $$The output of the MoE layer, $y$, is then the weighted sum of the outputs from the selected experts:$$ y = \sum_{i \in S} G(x)_i \cdot E_i(x) $$The hyperparameter $k$ is a significant design choice. Using $k=2$ allows each token to be processed by two experts, providing a richer representation than routing to a single expert. This can improve model quality at the cost of increased computation. In contrast, setting $k=1$, as popularized by Switch Transformers, minimizes computational and communication overhead.digraph G { rankdir=TB; node [shape=record, style="rounded,filled", fillcolor="#e9ecef", fontname="sans-serif"]; edge [fontname="sans-serif"]; subgraph cluster_input { label = "Input Token"; style = "filled"; color = "#f8f9fa"; token [label="Token x", shape=box, style="filled", fillcolor="#a5d8ff"]; } subgraph cluster_gating { label = "Gating Network (Top-k=2)"; style="filled"; color="#f8f9fa"; gate [label="{ <f0> Gating Logits h(x) | { E1: 0.8 | E2: 1.5 | E3: -0.2 | E4: 2.1 | ... | E8: 0.5 }}"]; topk [label="TopK Selection", shape=invhouse, style="filled", fillcolor="#ffc9c9"]; gate:f0 -> topk; } subgraph cluster_experts { label = "Expert Networks"; style="filled"; color="#f8f9fa"; E1 [label="Expert 1", fillcolor="#e9ecef"]; E2 [label="Expert 2", fillcolor="#b2f2bb"]; E3 [label="Expert 3", fillcolor="#e9ecef"]; E4 [label="Expert 4", fillcolor="#b2f2bb"]; E_dots [label="...", shape=plaintext]; E8 [label="Expert 8", fillcolor="#e9ecef"]; } output [label="Final Output", shape=box, style="filled", fillcolor="#a5d8ff"]; token -> gate; topk -> E2 [label=" G(x)₂"]; topk -> E4 [label=" G(x)₄"]; {E2, E4} -> output [label="Weighted Sum"]; }The gating process for a single token. The gating network computes logits, and the TopK function selects the two experts with the highest scores (Expert 2 and Expert 4) to process the token.Load Imbalance and its ConsequencesWhile functionally simple, Top-k gating has a significant operational weakness: it can lead to severe load imbalance. During training, the gating network may learn to favor a small subset of "popular" experts, while others remain underutilized or are rarely selected. This phenomenon arises because the gating network is a learned function, and without constraints, it will optimize for predictive accuracy, which may involve repeatedly using the same few experts that prove most effective early in training.This imbalance introduces two major problems:Training Inefficiency: Under-selected experts receive few training signals, causing their parameters to be poorly optimized. The model effectively fails to use its full capacity, as a large fraction of its weights contribute very little to its performance.Computational Waste: On a hardware level, experts are typically distributed across multiple devices (e.g., GPUs). If one expert receives a disproportionately high number of tokens, its assigned device becomes a bottleneck, while devices hosting unpopular experts sit idle.{"layout":{"title":{"text":"Example of Imbalanced Expert Load"},"xaxis":{"title":{"text":"Expert ID"},"tickmode":"array","tickvals":[0,1,2,3,4,5,6,7],"ticktext":["E1","E2","E3","E4","E5","E6","E7","E8"]},"yaxis":{"title":{"text":"Tokens Assigned in Batch"}},"bargap":0.2,"plot_bgcolor":"#f8f9fa","paper_bgcolor":"#ffffff"},"data":[{"type":"bar","x":[0,1,2,3,4,5,6,7],"y":[120,550,80,115,490,95,75,105],"marker":{"color":["#4dabf7","#fa5252","#4dabf7","#4dabf7","#fa5252","#4dabf7","#4dabf7","#4dabf7"],"line":{"width":0}}}]}A snapshot of token distribution across eight experts in a single training batch. Experts 2 and 5 are heavily overloaded, while the others are underutilized, indicating poor load balance.Capacity Factor and Dropped TokensTo manage the flow of tokens and prevent any single expert from being overwhelmed, MoE implementations introduce a capacity factor. This hyperparameter defines a buffer for how many tokens each expert can process in a batch. The capacity $C$ for each expert is defined as:$$ C = \mathrm{capacity\ factor} \times \frac{\mathrm{total\ tokens\ in\ batch}}{\mathrm{number\ of\ experts}} $$A capacity_factor greater than 1.0 provides a buffer for minor imbalances. For example, a value of 1.25 allows each expert to process up to 25% more tokens than the average.However, if a popular expert receives more tokens than its capacity $C$, the excess tokens are "dropped." These dropped tokens do not get processed by any expert. Instead, they bypass the MoE layer entirely and are passed through the residual connection. This is a form of information loss and can degrade model performance, as the model is unable to apply specialized computation to those tokens. Finding the right capacity_factor is a trade-off: a higher value reduces the number of dropped tokens but increases memory allocation and can lead to more wasted computation if the load is balanced.The primary mechanism for combating load imbalance in a standard Top-k system is the auxiliary load balancing loss, which we introduced in Chapter 1. This loss term penalizes imbalanced routing decisions, encouraging the gating network to spread tokens more evenly across all available experts. It is an essential component for stabilizing training and making Top-k gating viable in practice. However, as we will see in the following sections, modifying the routing algorithm itself can provide more direct and powerful solutions to this challenge.