A standard dense neural network operates on a principle of full activation. For every input, every single parameter in the model participates in the computation. This approach has been remarkably successful, but it runs into a significant bottleneck as models scale into the hundreds of billions of parameters. The computational cost, measured in floating-point operations (FLOPs), grows in direct proportion to the parameter count, making training and inference prohibitively expensive.
Sparsely-Gated Mixture of Experts (MoE) architectures offer a more efficient scaling path by embracing conditional computation. The core idea is simple yet powerful: instead of activating the entire network for every input, we activate only a small, relevant portion of it. An MoE model achieves this by maintaining a large collection of "expert" sub-networks and using a lightweight "gating network" to dynamically select which experts should process each input token. This allows for a massive increase in model capacity (total parameters) while keeping the computational cost for a single forward pass relatively constant.
The architecture of a Sparsely-Gated MoE layer is composed of two primary components.
Experts are the workhorses of the MoE layer. Each expert is itself a neural network, typically a simple feed-forward network (FFN), that learns to specialize in processing certain types of information. For example, in a large language model, one expert might become adept at handling syntax in Python code, while another might specialize in French-language idioms.
Because an input token is only routed to a few experts (often just one or two), the model can contain a large number of them. A model might have 8, 64, or even more experts, but only a fraction are used for any given token. This is the source of the architecture's efficiency. The total parameter count is the sum of all expert parameters (plus the small gating network), but the active parameter count for a forward pass remains small.
The gating network, also called the router, acts as the traffic controller. Its job is to examine an incoming token and decide which of the available experts are best suited to process it. The gating network is usually a small linear layer followed by a softmax function. It takes the token's input representation and produces a probability distribution over all experts.
Here, is the learnable weight matrix of the gating network. The resulting gates vector contains the scores that determine which experts are chosen. To enforce sparsity, a top-k function is applied to this vector. For example, if , only the two experts with the highest scores are selected. The scores of all other experts are set to zero, effectively deactivating them for the current token.
The complete forward pass for a single token involves a coordinated process between the gating network and the experts. The token is sent to the gating network to determine which experts to activate. Simultaneously, the token is passed to the selected experts for processing. Finally, the outputs from the active experts are combined in a weighted sum, using the gating scores as the weights.
The flow of a token through an MoE layer. The input token is sent to the gating network, which outputs scores for all experts. Based on these scores (e.g., top-k), the token is routed to a small subset of active experts (here, Expert 2). The outputs from the active experts are aggregated to produce the final output. Inactive experts do not perform any computation.
This sparse activation is the defining feature of the architecture. In a Transformer model, the MoE layer is typically used to replace the dense feed-forward network (FFN) block that appears after the self-attention mechanism. By doing so, we can dramatically scale the number of parameters in the FFN part of the model without a corresponding surge in computational demand, creating a path for more capable and efficient large-scale models. In the following sections, we will examine the mathematical details of this process, the challenges it introduces, and the techniques used to overcome them.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with