Substituting feed-forward networks (FFNs) with Mixture-of-Experts (MoE) layers is a general strategy that has led to several influential architectures. These architectures introduce specific design patterns and refinements. Their variants demonstrate different approaches to routing, layer placement, and optimization, each with unique trade-offs. Understanding these landmark models provides a map of the design space for building high-performance sparse networks.
The Sparsely-Gated Mixture-of-Experts (ST-MoE) layer, introduced in the "Outrageously Large Neural Networks" paper, is the foundational architecture upon which most modern MoEs are built. Its design introduced the two components that are now standard practice:
The ST-MoE architecture established the viability of training models with hundreds of billions of parameters while maintaining a constant computational cost, setting the stage for the variants that followed.
The Generalist Language Model (GLaM) demonstrated the impressive scaling properties of MoEs by training a 1.2 trillion parameter model. For comparison, its computation per token was equivalent to a dense 15-billion parameter model, showcasing the parameter-to-FLOPs decoupling discussed earlier.
GLaM’s architectural contribution lies in its specific and effective layer placement strategy. Instead of replacing every FFN layer, GLaM replaces the FFN with an MoE layer in every other Transformer block. This alternating pattern proved to be a highly effective configuration.
Diagram illustrating the alternating placement of dense FFN and MoE FFN layers in the GLaM architecture.
GLaM uses a Top-2 gating mechanism, meaning each token is processed by the two experts with the highest router scores. The results were significant: GLaM substantially outperformed the 175-billion parameter GPT-3 on a range of language tasks while requiring one-third of the energy to train. This highlighted that scaling model size via sparsity could yield better performance per unit of computation.
The Switch Transformer architecture proposed a radical simplification of MoE routing to maximize training and inference efficiency. Instead of routing a token to the top experts, a Switch Transformer routes each token to only one expert (). This is also known as Switch or Top-1 routing.
The primary motivation for this design is to reduce communication overhead in distributed training setups. In a standard Top-k MoE, each token needs to be dispatched to multiple experts, which may reside on different hardware accelerators. This all-to-all communication pattern can create bottlenecks. By restricting the routing to a single expert, the communication pattern becomes much simpler and faster.
This simplification results in a notable speedup. The authors of the Switch Transformer paper reported training speeds up to 7x faster than a dense model with an equivalent computational budget on TPUs. The trade-off is a potential reduction in representational capacity, as the model loses the ability to combine outputs from multiple experts for a single token. However, for many large-scale applications, the gains in efficiency outweigh this limitation.
Comparison of token flow in Top-2 routing versus Switch (Top-1) routing. Switch routing simplifies computation by sending each token to a single expert.
The Switch Transformer also highlighted the importance of training stability, particularly with lower precision formats like BFloat16. The authors introduced techniques like a selective precision casting and initializing the gating network weights to ensure stable training at scale.
These architectural variants represent points in a broad design space. The choice between them depends on the project's goals, such as maximizing model quality, minimizing training time, or simplifying deployment.
| Architecture | Routing Strategy | Innovation | Primary Benefit |
|---|---|---|---|
| ST-MoE | Noisy Top-k | Introduced sparse gating and load balancing loss. | Established the foundation for large sparse models. |
| GLaM | Top-2 | Scaled MoEs to trillion-parameter scale with a specific layer placement. | High model quality with lower inference cost. |
| Switch Transformer | Top-1 (Switch) | Simplified routing to a single expert per token. | Reduced communication overhead and improved training speed. |
As you design your own MoE-based models, you can draw from these patterns. You might choose the simplicity and speed of Switch routing for a production system or opt for the higher capacity of GLaM-style Top-2 routing for a research model where performance is the primary objective. The hands-on exercise at the end of this chapter will give you a chance to implement these ideas directly.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with