While the principle of substituting feed-forward networks (FFNs) with MoE layers is a general strategy, several influential architectures have introduced specific design patterns and refinements. These variants demonstrate different approaches to routing, layer placement, and optimization, each with unique trade-offs. Understanding these landmark models provides a map of the design space for building high-performance sparse networks.ST-MoE: The Foundational BlueprintThe Sparsely-Gated Mixture-of-Experts (ST-MoE) layer, introduced in the "Outrageously Large Neural Networks" paper, is the foundational architecture upon which most modern MoEs are built. Its design introduced the two components that are now standard practice:Sparsely-Gated Routing: The use of a trainable gating network that selects the top $k$ experts for each token. This gating mechanism is what makes the model "sparse," as only a fraction of the network's weights are used for any given input. The output is a weighted sum of the outputs from the selected experts.Auxiliary Load Balancing Loss: A supplemental loss function added to the main training objective. As discussed in Chapter 1, this loss encourages the gating network to distribute tokens evenly across all available experts, preventing a situation where only a few experts are consistently chosen while others remain untrained.The ST-MoE architecture established the viability of training models with hundreds of billions of parameters while maintaining a constant computational cost, setting the stage for the variants that followed.GLaM: Scaling to Trillion-Parameter ModelsThe Generalist Language Model (GLaM) demonstrated the impressive scaling properties of MoEs by training a 1.2 trillion parameter model. For comparison, its computation per token was equivalent to a dense 15-billion parameter model, showcasing the parameter-to-FLOPs decoupling discussed earlier.GLaM’s architectural contribution lies in its specific and effective layer placement strategy. Instead of replacing every FFN layer, GLaM replaces the FFN with an MoE layer in every other Transformer block. This alternating pattern proved to be a highly effective configuration.digraph G { rankdir=TB; graph [fontname="Arial", bgcolor="transparent"]; node [shape=box, style="filled,rounded", fontname="Arial", fontsize=10]; edge [fontname="Arial", fontsize=9]; subgraph cluster_glam { label="GLaM Architecture Pattern"; style=dashed; color="#495057"; tn1 [label="Transformer Block N", fillcolor="#dee2e6"]; mha1 [label="Multi-Head Attention", fillcolor="#a5d8ff"]; ffn1 [label="Dense FFN", fillcolor="#b2f2bb"]; tn1 -> mha1 [style=invis]; mha1 -> ffn1 [style=invis]; tn2 [label="Transformer Block N+1", fillcolor="#dee2e6"]; mha2 [label="Multi-Head Attention", fillcolor="#a5d8ff"]; moe2 [label="MoE FFN", fillcolor="#ffc078"]; tn2 -> mha2 [style=invis]; mha2 -> moe2 [style=invis]; tn3 [label="Transformer Block N+2", fillcolor="#dee2e6"]; mha3 [label="Multi-Head Attention", fillcolor="#a5d8ff"]; ffn3 [label="Dense FFN", fillcolor="#b2f2bb"]; tn3 -> mha3 [style=invis]; mha3 -> ffn3 [style=invis]; tn1 -> tn2 [ltail=cluster_glam, lhead=cluster_glam, style=invis]; tn2 -> tn3 [ltail=cluster_glam, lhead=cluster_glam, style=invis]; } }Diagram illustrating the alternating placement of dense FFN and MoE FFN layers in the GLaM architecture.GLaM uses a Top-2 gating mechanism, meaning each token is processed by the two experts with the highest router scores. The results were significant: GLaM substantially outperformed the 175-billion parameter GPT-3 on a range of language tasks while requiring one-third of the energy to train. This highlighted that scaling model size via sparsity could yield better performance per unit of computation.Switch Transformers: Simplifying for SpeedThe Switch Transformer architecture proposed a radical simplification of MoE routing to maximize training and inference efficiency. Instead of routing a token to the top $k$ experts, a Switch Transformer routes each token to only one expert ($k=1$). This is also known as Switch or Top-1 routing.The primary motivation for this design is to reduce communication overhead in distributed training setups. In a standard Top-k MoE, each token needs to be dispatched to multiple experts, which may reside on different hardware accelerators. This all-to-all communication pattern can create bottlenecks. By restricting the routing to a single expert, the communication pattern becomes much simpler and faster.This simplification results in a notable speedup. The authors of the Switch Transformer paper reported training speeds up to 7x faster than a dense model with an equivalent computational budget on TPUs. The trade-off is a potential reduction in representational capacity, as the model loses the ability to combine outputs from multiple experts for a single token. However, for many large-scale applications, the gains in efficiency outweigh this limitation.digraph G { rankdir=TB; graph [fontname="Arial", bgcolor="transparent"]; splines=ortho; node [shape=box, style="filled,rounded", fontname="Arial", fontsize=10]; edge [fontname="Arial", fontsize=9]; subgraph cluster_top2 { label="Top-2 Routing"; style=filled; color="#e9ecef"; node [fillcolor="#a5d8ff"]; token1 [label="Token"]; gate1 [label="Gating\nNetwork", shape=diamond, fillcolor="#ffd8a8"]; subgraph cluster_experts_top2 { label="Experts"; style=dotted; node [fillcolor="#b2f2bb"]; e1 [label="Expert 1"]; e2 [label="Expert 2"]; e3 [label="Expert 3"]; e4 [label="Expert 4"]; } combine1 [label="Weighted Sum", shape=circle, fillcolor="#fcc2d7"]; token1 -> gate1; gate1 -> e2 [label=" g=0.6"]; gate1 -> e4 [label=" g=0.4"]; e2 -> combine1; e4 -> combine1; } subgraph cluster_switch { label="Switch Routing (Top-1)"; style=filled; color="#e9ecef"; node [fillcolor="#a5d8ff"]; token2 [label="Token"]; gate2 [label="Gating\nNetwork", shape=diamond, fillcolor="#ffd8a8"]; subgraph cluster_experts_switch { label="Experts"; style=dotted; node [fillcolor="#b2f2bb"]; e5 [label="Expert 1"]; e6 [label="Expert 2"]; e7 [label="Expert 3"]; e8 [label="Expert 4"]; } output_switch [label="Output", shape=circle, fillcolor="#fcc2d7"]; token2 -> gate2; gate2 -> e7 [label=" g=1.0"]; e7 -> output_switch; } }Comparison of token flow in Top-2 routing versus Switch (Top-1) routing. Switch routing simplifies computation by sending each token to a single expert.The Switch Transformer also highlighted the importance of training stability, particularly with lower precision formats like BFloat16. The authors introduced techniques like a selective precision casting and initializing the gating network weights to ensure stable training at scale.Summary of Architectural PropertiesThese architectural variants represent points in a broad design space. The choice between them depends on the project's goals, such as maximizing model quality, minimizing training time, or simplifying deployment.ArchitectureRouting StrategyInnovationPrimary BenefitST-MoENoisy Top-kIntroduced sparse gating and load balancing loss.Established the foundation for large sparse models.GLaMTop-2Scaled MoEs to trillion-parameter scale with a specific layer placement.High model quality with lower inference cost.Switch TransformerTop-1 (Switch)Simplified routing to a single expert per token.Reduced communication overhead and improved training speed.As you design your own MoE-based models, you can draw from these patterns. You might choose the simplicity and speed of Switch routing for a production system or opt for the higher capacity of GLaM-style Top-2 routing for a research model where performance is the primary objective. The hands-on exercise at the end of this chapter will give you a chance to implement these ideas directly.