While Expert Parallelism (EP) directly addresses the distribution of experts within an MoE layer and Data Parallelism (DP) replicates the entire model, Pipeline Parallelism (PP) offers a complementary approach to manage the considerable memory footprint and computational graph depth of large models, including those incorporating MoE layers. Pipeline parallelism partitions the model's layers, not the data or the experts themselves, into sequential stages, assigning each stage to a different set of processing devices.
Imagine a deep transformer model. Instead of placing all layers onto every device (as in pure DP) or splitting only the MoE experts across devices (as in pure EP), PP divides the sequence of layers. For instance, layers 1-8 might form Stage 1 on Device Group A, layers 9-16 form Stage 2 on Device Group B, and so on.
Micro-batching and Pipeline Execution
A naive implementation of PP would process a full data batch through Stage 1, pass the activations to Stage 2, process there, and continue sequentially. This leads to significant device idle time, as only one stage is active at any moment. To improve utilization, PP employs micro-batching. The full data batch is split into smaller micro-batches (m1,m2,...,mk).
Execution proceeds in a staggered fashion:
- Stage 1 processes m1.
- Stage 1 processes m2 while Stage 2 processes the output of m1 from Stage 1.
- Stage 1 processes m3, Stage 2 processes m2's output, Stage 3 processes m1's output, etc.
This creates a "pipeline" effect, keeping multiple stages busy concurrently. The forward pass propagates micro-batches through the stages, and the backward pass propagates gradients back in reverse order.
Visualization of pipeline parallelism with 3 stages and micro-batching. F denotes forward pass, B denotes backward pass, m denotes micro-batch index. Idle periods (bubbles) occur at the start and end of the schedule.
Integrating Pipeline Parallelism with MoE Layers
How does an MoE layer interact with pipeline stages? Typically, an entire MoE layer (gating network and associated experts) resides within a single pipeline stage. Splitting an MoE layer itself across stage boundaries is generally avoided due to the complex dependencies and communication involved in routing tokens to experts.
Therefore, a pipeline stage containing an MoE layer might look like this internally:
- Input activations arrive from the previous stage (or the initial embedding layer).
- These activations are processed by layers within the current stage, potentially including standard transformer blocks.
- If an MoE layer is present:
- The gating network computes expert assignments for each token.
- The All-to-All communication occurs within the device group assigned to this stage to route tokens to their designated experts (assuming EP is also used within the stage).
- Experts process their assigned tokens.
- Outputs are combined and potentially passed to subsequent layers within the same stage.
- The final activations of the stage are passed to the next pipeline stage.
Hybrid Parallelism Strategies: PP + DP + EP
For truly massive MoE models, pipeline parallelism is most effective when combined with Data Parallelism and Expert Parallelism. A common configuration is:
- Pipeline Parallelism (PP): The model's layers are divided into stages across different nodes or groups of GPUs. This addresses the activation memory bottleneck and allows scaling beyond single-node memory capacity.
- Data Parallelism (DP): Within each pipeline stage, the layers assigned to that stage are replicated across multiple GPUs. Each GPU processes a different micro-batch (or a fraction thereof). This scales computation within the stage.
- Expert Parallelism (EP): For any MoE layers within a pipeline stage, the experts are distributed across the data-parallel replicas of that stage. The All-to-All communication for token routing happens among the GPUs participating in that specific stage.
Hybrid parallelism combining PP, DP, and EP. Stages are formed via PP. Within each stage, DP replicates the stage logic, and EP distributes experts across the DP ranks. All-to-All for EP happens within a stage, while activations/gradients flow between stages for PP.
Considerations for PP in MoE Models
- Pipeline Bubbles: While micro-batching helps, the pipeline startup and teardown phases still introduce idle time (bubbles). Advanced scheduling techniques, like interleaved pipeline schedules (e.g., 1F1B schedule), can further mitigate this by overlapping forward and backward passes more effectively.
- Stage Balancing: Achieving optimal throughput requires balancing the computational load across pipeline stages. MoE layers are often significantly more computationally intensive than standard dense layers. Placing MoE layers requires careful consideration to avoid making one stage a major bottleneck. The number of experts and the expert size influence stage duration.
- Memory Balancing: Similarly, the memory requirements (parameters and activations) should be balanced across stages. PP is excellent at distributing activation memory, which scales with sequence length and batch size. MoE layers add parameter memory, distributed via EP within the stage.
- Communication: PP introduces communication overhead between stages for activations (forward) and gradients (backward). This is typically point-to-point or collective communication between adjacent stage device groups. This communication cost must be weighed against the All-to-All cost within stages containing MoE layers. Optimizing both is essential.
- Framework Complexity: Implementing hybrid parallelism strategies requires sophisticated distributed training frameworks like DeepSpeed, Megatron-LM, or Tutel, which provide abstractions to manage the complex interplay between PP, DP, and EP/TP (Tensor Parallelism).
Pipeline parallelism, particularly when integrated into a hybrid strategy, provides a powerful mechanism for scaling MoE models beyond the constraints of single-node memory and computation, enabling the training of models with trillions of parameters by effectively partitioning the workload across layers and devices.