Routing tokens to multiple experts (k>1) is a common method to combine specialized knowledge, yet it introduces significant computational and communication overhead. This approach requires each token to be processed by multiple experts, with their outputs then weighted and combined. The Switch Transformer architecture, proposed by Fedus, Zoph, and Shazeer, presents a radical simplification to this process: what if each token is routed to only one expert?
This top-1 routing (k=1) dramatically simplifies the MoE layer. Instead of computing a weighted sum of outputs from several experts, the layer's output for a given token is simply the output of the single, best-suited expert, scaled by the router's gating score.
The core of the Switch Transformer is its routing mechanism. For an input token representation , the gating network computes a distribution over all experts. However, instead of selecting the top-k experts, it only selects the single expert with the highest score.
Here, is the vector of gating scores, and is the -th expert network. This design choice has immediate benefits:
The diagram below illustrates the difference between a standard MoE with k=2 and a Switch layer with k=1. In the standard model, each token is processed by two experts. In the Switch model, each token is dispatched to exactly one.
Comparison of routing strategies for two tokens. The standard MoE sends each token to its top two experts, whereas the Switch Transformer sends each token to only its top expert.
A top-1 routing strategy might appear to worsen the problem of load imbalance. If the router has a strong preference, it could consistently select the same expert, leaving others unused. To counteract this, Switch Transformers use a refined version of the auxiliary load-balancing loss.
The goal remains the same: to encourage the router to distribute tokens uniformly across all available experts. The loss is calculated over a batch of tokens and experts. It is the dot product of two vectors:
The auxiliary loss, , is defined as:
Let's break down the components:
0.01 to be effective.Multiplying and together encourages the router's probability distribution to align with the actual token dispatch distribution . This incentivizes the router to spread its probability mass more evenly, which in turn leads to a more balanced load.
The simplification of k=1 routing introduces a new implementation detail: expert capacity. In a distributed setup, each expert (residing on a specific device) is allocated a static buffer to handle a certain number of tokens per batch. The size of this buffer is determined by the capacity_factor (C).
An ideal, perfectly balanced router would send exactly tokens to each expert. The capacity factor provides a buffer for statistical variance. For example, a capacity_factor of 1.25 means each expert can handle 25% more tokens than the average.
What happens if an expert's capacity is exceeded? The Switch architecture makes another simplifying choice: the token is dropped. It is not processed by any expert. Instead, its representation from the residual connection is passed directly to the next layer. This is equivalent to the token passing through an identity function within the MoE layer. While dropping tokens seems detrimental, experiments show that with a well-tuned capacity factor and auxiliary loss, the percentage of dropped tokens is often low (<1%) and does not significantly harm overall model performance.
The design of Switch Transformers yields a compelling trade-off between model size and computational cost. By using a large number of experts, the total parameter count of the model can be massive, but because only one expert is activated per token, the training and inference FLOPs remain comparable to a much smaller dense model.
Illustrative comparison of model scale and computation. Both MoE models have 8x the parameters of the dense model. The standard MoE (k=2) more than doubles the FLOPs, while the Switch Transformer (k=1) adds only a small computational overhead.
The authors of the Switch Transformer paper also noted that training stability can be a challenge. They found that using lower-precision formats like BFloat16 was important, especially for the router's softmax computation, to prevent instabilities caused by large logit values.
In summary, the Switch Transformer presents an efficient and scalable architecture. Its primary trade-offs are:
Was this section helpful?
© 2026 ApX Machine LearningEngineered with