While expert parallelism is a powerful tool for scaling the number of parameters in a Mixture of Experts model, it rarely operates in isolation. To train state-of-the-art sparse models, which can have trillions of parameters yet must still be trained on massive datasets, you need to orchestrate multiple parallelism strategies at once. Each strategy addresses a different scaling bottleneck, and their combination allows for training models at a scale that would otherwise be impossible.The three primary dimensions of parallelism are:Data Parallelism (DP): The simplest form. The model is replicated on each device, and the global data batch is split among them. It improves throughput but requires that the entire model fits on a single device.Tensor Parallelism (TP): A form of model parallelism where individual layers, like large weight matrices, are sharded across devices. This allows the dense components of a model to exceed the memory of a single GPU.Expert Parallelism (EP): The MoE-specific strategy where different expert networks are placed on different devices. This scales the number of experts, and thus the total parameter count.Successfully training a large-scale MoE involves weaving these strategies together into a cohesive and efficient training configuration.2D Combination: Data and Expert Parallelism (DP+EP)The most common hybrid strategy for MoE models combines data and expert parallelism. This approach is effective for models where the dense components (like attention blocks and embeddings) can fit on a single GPU, but the total number of experts is too large.In this setup, you typically have a group of devices, for example, the 8 GPUs within a single server node. The parallelism works as follows:Data Parallelism: The non-expert layers of the model are replicated on all 8 GPUs. The global batch is split evenly, with each GPU processing its own slice of data through these replicated layers.Expert Parallelism: The experts within each MoE layer are sharded across the 8 GPUs. If an MoE layer has 64 experts, each GPU would hold $64 / 8 = 8$ experts.The forward pass involves a critical communication step. After a token passes through a local dense layer, the gating network determines which expert it should be routed to. If that expert resides on another GPU, the token's hidden state must be sent across the device interconnect. This is accomplished using an all-to-all communication collective, where every GPU sends a subset of its tokens to every other GPU and receives tokens in return. After the remote expert processes the token, another all-to-all operation sends it back to its original device to continue the forward pass.digraph G { rankdir=TB; splines=true; overlap=false; node [shape=box, style="rounded,filled", fontname="helvetica", margin="0.2,0.1"]; edge [fontname="helvetica", fontsize=10]; bgcolor="transparent"; newrank=true; compound=true; // Invisible nodes to control token placement sub_0 [style=invis, width=0.1, height=0.1]; sub_1 [style=invis, width=0.1, height=0.1]; sub_2 [style=invis, width=0.1, height=0.1]; sub_3 [style=invis, width=0.1, height=0.1]; {rank=same; sub_0; sub_1; sub_2; sub_3;} subgraph cluster_gpus { label="Single Node (Data Parallel Group)"; style="rounded"; color="#868e96"; fontname="helvetica"; fontsize=12; margin=20; {rank=same; GPU1; GPU2; GPU3; GPU0;} // Order for visual layout GPU1 [label="GPU 1\nDense Layers (Replica)\nExperts 17-32", fillcolor="#a5d8ff"]; GPU2 [label="GPU 2\nDense Layers (Replica)\nExperts 33-48", fillcolor="#a5d8ff"]; GPU3 [label="GPU 3\nDense Layers (Replica)\nExperts 49-64", fillcolor="#a5d8ff"]; GPU0 [label="GPU 0\nDense Layers (Replica)\nExperts 1-16", fillcolor="#a5d8ff"]; } T1 [label="Tokens", shape=note, fillcolor="#ffec99", style=filled]; T2 [label="Tokens", shape=note, fillcolor="#ffec99", style=filled]; T3 [label="Tokens", shape=note, fillcolor="#ffec99", style=filled]; T0 [label="Tokens", shape=note, fillcolor="#ffec99", style=filled]; // Connect invisible nodes to GPUs to enforce horizontal alignment of Tokens T1 -> sub_1 [style=invis]; T2 -> sub_2 [style=invis]; T3 -> sub_3 [style=invis]; T0 -> sub_0 [style=invis]; // Batch Slice connections sub_1 -> GPU1 [lhead=cluster_gpus, label="Batch Slice 1", minlen=2]; sub_2 -> GPU2 [lhead=cluster_gpus, label="Batch Slice 2", minlen=2]; sub_3 -> GPU3 [lhead=cluster_gpus, label="Batch Slice 3", minlen=2]; sub_0 -> GPU0 [lhead=cluster_gpus, label="Batch Slice 0", minlen=2]; // All-to-All Token Shuffle within the cluster edge [color="#f06595", style=dashed, constraint=false, arrowhead=normal, arrowtail=normal, dir=both]; GPU1 -> GPU2 [label="All-to-All Token Shuffle"]; GPU2 -> GPU3 [label="All-to-All Token Shuffle"]; GPU3 -> GPU0 [label="All-to-All Token Shuffle"]; GPU0 -> GPU1 [label="All-to-All Token Shuffle"]; // Additional dashed lines from Tokens to GPUs with bend to mimic original T1 -> GPU1 [color="#f06595", style=dashed, constraint=false, arrowhead=normal, label="All-to-All Token Shuffle"]; T2 -> GPU2 [color="#f06595", style=dashed, constraint=false, arrowhead=normal, label="All-to-All Token Shuffle"]; T3 -> GPU3 [color="#f06595", style=dashed, constraint=false, arrowhead=normal, label="All-to-All Token Shuffle"]; T0 -> GPU0 [color="#f06595", style=dashed, constraint=false, arrowhead=normal, label="All-to-All Token Shuffle"]; // Hidden edges to help with token alignment over GPUs edge [style=invis]; T1 -> T2 -> T3 -> T0; // Ensure tokens are in a row }A 2D parallelism setup with 4 GPUs. The model's dense layers are replicated for data parallelism, while the 64 experts are sharded across the GPUs for expert parallelism. The dashed lines represent the all-to-all communication required to route tokens to their assigned experts.This DP+EP combination effectively scales the model's parameter count via more experts while also scaling the training throughput via larger global batch sizes. The primary bottleneck becomes the all-to-all communication, which can saturate the interconnect bandwidth between devices.3D Combination: Data, Expert, and Tensor Parallelism (DP+EP+TP)For the largest models, even the dense components become too large for a single GPU. This is where a third dimension, tensor parallelism, becomes necessary. Combining all three strategies enables training models of extreme scale across large, multi-node GPU clusters.The hierarchy of this arrangement can be visualized as a 3D grid of devices:Tensor Parallelism Group: A small group of GPUs (e.g., 2, 4, or 8) work together to form a single "virtual device." They collectively store and compute one instance of the model's large dense layers. Communication within this group is frequent, often requiring all-reduce or all-gather operations for every tensor-parallel layer.Expert Parallelism Group: Experts are sharded across these tensor parallel groups. For example, in a 16-GPU setup organized into four TP groups of 4 GPUs each, you might assign experts 1-16 to the first TP group, experts 17-32 to the second, and so on. Routing a token now means sending it from its source TP group to the target TP group.Data Parallelism Group: The entire multi-node configuration that hosts the tensor- and expert-parallel model can be replicated to process different data batches. If you have 32 GPUs, you might have two data-parallel replicas, each a 16-GPU setup as described above. This requires a final all-reduce across the data-parallel replicas to average gradients.This 3D approach creates distinct communication patterns at each level of the hierarchy: intra-group communication for tensor parallelism, inter-group all-to-all for expert parallelism, and a global all-reduce for data parallelism.digraph G { rankdir=LR; splines=true; overlap=false; node [shape=box, style="rounded,filled", fontname="helvetica", margin="0.2,0.1"]; edge [fontname="helvetica", fontsize=10]; bgcolor="transparent"; newrank=true; compound=true; subgraph cluster_dp_replica_0 { label="Data Parallel Replica 0"; style="rounded"; color="#adb5bd"; fontname="helvetica"; fontsize=12; margin=25; subgraph cluster_tp_0 { label="TP Group 0\nExperts 1-32"; style="rounded"; color="#868e96"; GPU0 [label="GPU 0", fillcolor="#d0bfff"]; GPU1 [label="GPU 1", fillcolor="#d0bfff"]; {rank=same; GPU0; GPU1;} } subgraph cluster_tp_1 { label="TP Group 1\nExperts 33-64"; style="rounded"; color="#868e96"; GPU2 [label="GPU 2", fillcolor="#d0bfff"]; GPU3 [label="GPU 3", fillcolor="#d0bfff"]; {rank=same; GPU2; GPU3;} } edge[style=dashed, color="#40c057", constraint=false, arrowhead=none]; GPU0 -> GPU1 [label="TP Comm."]; GPU2 -> GPU3 [label="TP Comm."]; edge[style=solid, color="#f06595", constraint=false, arrowhead=normal, arrowtail=normal, dir=both, minlen=2]; cluster_tp_0 -> cluster_tp_1 [label=" Expert Parallelism (All-to-All)"]; } subgraph cluster_dp_replica_1 { label="Data Parallel Replica 1"; style="rounded"; color="#adb5bd"; fontname="helvetica"; fontsize=12; margin=25; subgraph cluster_tp_2 { label="TP Group 2\nExperts 1-32"; style="rounded"; color="#868e96"; GPU4 [label="GPU 4", fillcolor="#d0bfff"]; GPU5 [label="GPU 5", fillcolor="#d0bfff"]; {rank=same; GPU4; GPU5;} } subgraph cluster_tp_3 { label="TP Group 3\nExperts 33-64"; style="rounded"; color="#868e96"; GPU6 [label="GPU 6", fillcolor="#d0bfff"]; GPU7 [label="GPU 7", fillcolor="#d0bfff"]; {rank=same; GPU6; GPU7;} } edge[style=dashed, color="#40c057", constraint=false, arrowhead=none]; GPU4 -> GPU5 [label="TP Comm."]; GPU6 -> GPU7 [label="TP Comm."]; edge[style=solid, color="#f06595", constraint=false, arrowhead=normal, arrowtail=normal, dir=both, minlen=2]; cluster_tp_2 -> cluster_tp_3 [label=" Expert Parallelism (All-to-All)"]; } edge[style=dotted, color="#1c7ed6", arrowhead=normal, arrowtail=normal, dir=both, constraint=false, minlen=3]; cluster_dp_replica_0 -> cluster_dp_replica_1 [label=" Data Parallelism (All-Reduce)"]; }A 3D parallelism strategy over 8 GPUs. The GPUs are first organized into 2-GPU tensor parallel (TP) groups. Expert parallelism shards the 64 experts across these TP groups. Finally, this entire 4-GPU setup is replicated for data parallelism. Each type of parallelism involves distinct communication patterns.Frameworks and Practical ImplementationManually implementing these combined strategies is an immense engineering task. Fortunately, distributed training frameworks like DeepSpeed, Megatron-LM, and JAX handle most of this complexity. These libraries provide high-level APIs to define the parallelism strategy, often by specifying the size of each dimension in the device grid.For example, using JAX with pjit, you might define a 3D device mesh: mesh = Mesh(devices, ('data', 'expert', 'model'))You would then use annotations to tell the compiler how to partition the model's weights and intermediate activations along this mesh. The framework's compiler is responsible for translating these annotations into the correct low-level communication collectives (all-reduce, all-to-all, all-gather).Choosing the right combination and configuration depends on your specific model architecture and hardware environment. A model with a large number of experts but relatively small dense layers will benefit most from a larger expert parallel dimension. Conversely, a model with enormous dense layers might prioritize a larger tensor parallel dimension. Balancing these dimensions to maximize hardware utilization and minimize communication bottlenecks is a significant part of optimizing large-scale MoE training.