As introduced earlier in this chapter, scaling Mixture of Experts models confronts us with substantial memory demands, primarily driven by the potentially large number of individual expert networks. Standard Data Parallelism (DP), where each worker holds a complete replica of the model, quickly becomes impractical as the number of experts grows into the dozens or hundreds. Each expert might itself be a multi-layer perceptron (MLP) with millions of parameters. Storing all experts on every device leads to prohibitive memory consumption.
Expert Parallelism (EP) directly addresses this challenge by partitioning the experts within an MoE layer across the available computational devices (e.g., GPUs). Instead of each device holding all N experts, each device holds only a fraction, typically N/D, where D is the number of devices participating in the expert parallel group.
Consider an MoE layer within a transformer block. Under Expert Parallelism, the sequence of operations for processing a batch of tokens involves these steps:
All-to-All
communication primitive is used to exchange tokens among devices. Device i sends the tokens destined for experts on device k directly to device k.All-to-All
communication step gathers these results, ensuring each device receives the outputs corresponding to the tokens it initially processed in step 1.The diagram below illustrates this flow across four devices, each holding two distinct experts.
Distribution of 8 experts across 4 devices (2 experts per device). Dashed lines represent the first
All-to-All
communication (sending tokens T based on gating assignments to target experts Ei). Dotted lines represent the secondAll-to-All
(returning processed tokens P(T) to their originating device). Only a subset of routes is shown for clarity.
The primary advantage of Expert Parallelism is memory efficiency. By partitioning the experts, you can instantiate MoE models with a vastly larger total number of parameters than would fit onto a single device. This allows for scaling model capacity (through more experts) without proportionally increasing the memory burden on individual workers. It also distributes the computational load of the expert forward and backward passes.
However, this benefit comes at the cost of increased communication. The two All-to-All
operations are communication-intensive, especially at large scales. Their latency and bandwidth requirements can become significant bottlenecks, potentially limiting overall training throughput if not carefully managed. Optimizing this communication is a major focus when scaling MoE models, as discussed later in this chapter.
All-to-All
communication and managing the distributed state requires specialized libraries. Frameworks like DeepSpeed (using its MoE implementation) and Tutel abstract much of this complexity, providing optimized communication kernels and integration with standard deep learning frameworks like PyTorch. These libraries handle the token shuffling and expert computation coordination.In summary, Expert Parallelism is a foundational technique for scaling Mixture of Experts models. It partitions experts across devices, enabling massive model sizes by reducing per-device memory requirements, but introduces significant All-to-All
communication overhead that must be carefully considered and optimized.
© 2025 ApX Machine Learning