In traditional dense feed-forward networks or transformer blocks, every input token interacts with all the parameters within that layer. When an input tensor x passes through a dense layer, the computation involves multiplying x by the layer's full weight matrix W. Subsequent layers repeat this process. This approach ensures that all parts of the network contribute to processing each piece of information.
Consider a standard feed-forward network (FFN) block within a transformer:
FFN(x)=ReLU(xW1+b1)W2+b2
Here, W1, b1, W2, and b2 represent the weights and biases. For every input token x, the entire set of these parameters is engaged in computation. The computational cost, often measured in Floating Point Operations (FLOPs), scales directly with the size of these weight matrices. If we increase the hidden dimension or intermediate size of the FFN, both the parameter count and the FLOPs per token increase significantly.
Sparse activation, as employed in Mixture of Experts models, fundamentally alters this dynamic. Instead of activating the entire layer, only a selected subset of parameters, the 'experts', are used for any given input token. Recall the basic MoE formulation introduced earlier:
y=∑i=1NG(x)iEi(x)
Typically, the gating network G(x) selects only a small number, k, of experts (e.g., k=2) out of a much larger pool of N experts (e.g., N=64). This means that for a single token, the computational cost is primarily determined by the size of the k chosen experts and the gating network, not the total number of experts N.
Let's break down the comparison:
Computational Cost (FLOPs)
- Dense Models: The FLOPs per token are proportional to the size of the dense layer parameters. Scaling up the model capacity by increasing layer width directly increases the computational cost for every single token. For an FFN with input/output dimension dmodel and intermediate dimension dff, the FLOPs are roughly 2×dmodel×dff.
- Sparse MoE Models: The FLOPs per token are dominated by the computation within the k selected experts. If each expert Ei has approximately the same computational cost as a smaller dense FFN, the total computational cost per token is roughly k times the cost of one expert, plus the small overhead of the gating network G(x). Importantly, increasing the total number of experts N (from, say, 16 to 64) does not increase the FLOPs per token, assuming k remains constant. This decoupling allows MoE models to grow their parameter count and potential representational capacity much more dramatically than dense models for the same computational budget per token.
Comparison showing how FLOPs per token scale with total model parameters. Dense models exhibit coupled growth, while sparse MoE models allow parameter scaling with nearly constant FLOPs per token (assuming fixed k). Note the slight increase for MoE accounts for gating overhead and assumes experts scale parameters. Axes represent relative scale.
Parameter Count
- Dense Models: Parameter count and computational cost are tightly linked. Increasing one typically increases the other proportionally. A 100B parameter dense model requires computational resources capable of handling 100B parameters for every forward pass.
- Sparse MoE Models: Parameter count is decoupled from per-token FLOPs. An MoE model might have 1 Trillion total parameters distributed across N experts, but if k=2, the computation for a single token might resemble that of a much smaller dense model (roughly twice the size of a single expert). This is the primary appeal of MoEs for scaling model capacity beyond what's feasible with dense architectures under similar computational constraints.
Memory Usage
- Dense Models: During inference, the entire set of model parameters must typically be loaded into accelerator memory (e.g., GPU VRAM). Memory requirements scale directly with the parameter count.
- Sparse MoE Models: While only k experts are computationally active per token, practical implementations often require all N expert parameters to be accessible within the distributed system, particularly during training. This is because different tokens in a batch, potentially processed across different devices, might require different experts. Therefore, the total parameter memory footprint of an MoE model can be substantially larger than a dense model with equivalent per-token FLOPs. This necessitates sophisticated distributed training strategies (like expert parallelism, discussed in Chapter 4) to manage memory effectively. Activation memory during training also needs careful management, although the pattern differs from dense models.
Conceptual Flow
The difference in activation patterns can be visualized simply:
Data flow comparison. In a dense layer, the input activates all parameters. In a sparse MoE layer, the gating network routes the input to a subset of experts (e.g., Expert 1 and Expert 3), activating only their parameters. Inactive experts (e.g., Expert 2, Expert 4) are shown in gray.
Implications and Trade-offs
The sparse activation strategy offers a compelling path towards building models with enormous parameter counts while keeping the computational cost per token manageable. This allows for potentially greater model capacity and specialization. However, this benefit comes with inherent complexities:
- Training Challenges: Ensuring all experts are utilized effectively (load balancing) and learn distinct functions requires specialized techniques, such as auxiliary loss functions (Chapter 3).
- Communication Overhead: In distributed settings, routing tokens to the correct experts residing on different devices introduces significant communication costs (All-to-All communication), demanding optimization (Chapter 4).
- Inference Complexity: Efficiently handling dynamic routing and batching during inference requires careful implementation and hardware-specific optimizations (Chapter 5).
Understanding this fundamental contrast between dense and sparse activation is essential. It motivates the architectural choices, training methodologies, and scaling strategies specifically developed for Mixture of Experts models, which we will explore in the subsequent chapters. While dense models offer simplicity in implementation and training dynamics, sparse models provide a route to significantly larger model scale within feasible computational budgets, albeit with increased system complexity.