While Data Parallelism replicates the model and Tensor Parallelism splits individual operations within layers, Pipeline Parallelism (PP) takes a different approach to distributing the computational load. It partitions the entire model vertically, assigning consecutive layers to different devices, forming a processing pipeline much like an assembly line.

Imagine a large Transformer model with many layers. Instead of trying to fit all layers onto one device or splitting complex matrix multiplications within a layer, PP assigns, for example, layers 1-12 to GPU 0, layers 13-24 to GPU 1, layers 25-36 to GPU 2, and so on. Each group of layers executed on a single device is called a stage or partition.

Mechanics of Pipeline Processing

In a pipeline parallel setup, a data batch is first broken down into smaller micro-batches. This is essential for keeping the pipeline stages utilized effectively, as we'll see shortly.

The process works as follows:

Forward Pass: Micro-batch 1 enters Stage 0 (GPU 0). After processing through the layers in Stage 0, the resulting activations are sent to Stage 1 (GPU 1). Stage 1 processes the activations and sends its output to Stage 2, and this continues until the final stage computes the output for Micro-batch 1. Crucially, as soon as Stage 0 finishes with Micro-batch 1, it can immediately start processing Micro-batch 2.
Backward Pass: Once the loss is computed for a micro-batch (after it passes through all stages), the backward pass begins. The gradients are computed starting from the last stage and propagated backward. Stage N computes gradients and sends the necessary gradient information back to Stage N-1. Stage N-1 uses this, computes its own gradients, and sends gradients back to Stage N-2, and so on, until gradients reach Stage 0.

This flow allows different devices to work on different micro-batches simultaneously, parallelizing the computation across the model's depth.

A 4-stage pipeline showing forward (fwd) activation flow and backward (bwd) gradient flow across GPUs.

The Pipeline Bubble

A significant challenge in pipeline parallelism is the pipeline bubble or idle time. At the beginning of processing a batch, only Stage 0 is active. Stage 1 must wait for Stage 0 to finish the first micro-batch, Stage 2 must wait for Stage 1, and so on. Similarly, during the backward pass, the initial stages become idle as they wait for gradients to arrive from later stages. This startup and wind-down period results in underutilization of the hardware.

The size of this bubble depends on the number of pipeline stages ( $S$ ) and the number of micro-batches ( $M$ ). If each micro-batch takes roughly the same time ( $t$ ) to process through one stage, the total time for a simple sequential forward-then-backward schedule is approximately $T \approx (S + M - 1) \times t$ for the forward pass and similarly for the backward pass. The total useful work is $S \times M \times t$ . The efficiency (fraction of time devices are busy) is roughly $SM / (S(S+M-1))$ , which simplifies to $M/(M+S-1)$ for large $S$ . The bubble fraction (idle time) is approximately $(S-1)/(M+S-1)$ .

To minimize the bubble, we need to increase the number of micro-batches ( $M$ ) relative to the number of stages ( $S$ ). However, increasing $M$ means smaller micro-batches, which might not fully utilize the compute capabilities of each GPU, and it also increases the total activation memory required across all micro-batches in flight.

Pipeline Scheduling

To mitigate the bubble, various scheduling strategies have been developed past the simple "all forward, then all backward" approach (often associated with GPipe). A common and effective strategy is 1F1B (one forward, one backward) scheduling, popularized by frameworks like PipeDream.

In a 1F1B schedule, stages alternate between performing forward passes for upcoming micro-batches and backward passes for already completed micro-batches. Once a stage completes the forward pass for micro-batch i, it might immediately perform the backward pass for micro-batch i-k (where k is related to the number of stages), assuming the gradients are available from the next stage. This keeps the devices busier and significantly reduces the idle time bubble compared to the naive schedule.

Comparison of GPU utilization over time for Naive vs Interleaved 1F1B schedules with 3 stages and multiple micro-batches. Blue (Fwd) represents forward passes, Red (Bwd) represents backward passes, Gray (Idle) represents bubble time. 1F1B significantly reduces idle time.

Implementation Aspects

Implementing pipeline parallelism effectively requires careful consideration of several factors:

Load Balancing: The computational cost (time) of each stage should be roughly equal. If one stage takes significantly longer than others, it becomes a bottleneck, and the other stages will be idle waiting for it. Balancing Transformer layers can be non-trivial due to varying computation within attention vs. feed-forward blocks.
Communication: Communication occurs between adjacent stages, primarily transferring activations forward and gradients backward. The size of these transfers depends on the hidden dimension and sequence length. While potentially less frequent than the communication within layers required by Tensor Parallelism, these transfers can still be significant.
Micro-batch Size: As discussed, this parameter ( $M$ ) directly impacts the bubble size and memory usage. A larger $M$ reduces the bubble but increases the memory needed to store activations for all in-flight micro-batches. The optimal $M$ depends on the model size, number of stages, and available device memory.
Framework Support: Implementing efficient scheduling, communication, and gradient handling is complex. Libraries like DeepSpeed and Megatron-LM provide implementations of pipeline parallelism, often integrated with other parallelism strategies.

Here's a highly simplified PyTorch snippet illustrating the idea of stages and passing data (actual implementations are much more involved):

import torch
import torch.nn as nn

# --- Assume these are defined elsewhere ---
# get_my_stage_id() -> int
# get_num_stages() -> int
# get_device_for_stage(stage_id) -> torch.device
# send_tensor(tensor, to_stage_id)
# recv_tensor(from_stage_id) -> tensor
# global_micro_batch_size = ...
# model_layers = [...] # List of all model layers

class PipelineStage(nn.Module):
    def __init__(self, layers, stage_id):
        super().__init__()
        self.layers = nn.ModuleList(layers)
        self.stage_id = stage_id
        self.device = get_device_for_stage(stage_id)
        self.to(self.device)

    def forward(self, x):
        # Simplified: assumes x is received from previous stage
        # if stage_id > 0
        if x is not None:
             x = x.to(self.device)

        for layer in self.layers:
            x = layer(x)
        return x

# --- Partitioning the model (Example) ---
my_stage_id = get_my_stage_id()
num_stages = get_num_stages()
# Simplified balancing
layers_per_stage = len(model_layers) // num_stages

start_layer = my_stage_id * layers_per_stage
if my_stage_id < num_stages - 1:
    end_layer = (my_stage_id + 1) * layers_per_stage
else:
    end_layer = len(model_layers)
my_layers = model_layers[start_layer:end_layer]

pipeline_module = PipelineStage(my_layers, my_stage_id)

# --- Simplified Training Step (No Scheduling Logic) ---
def training_step(micro_batch_data):
    activations = None
    if my_stage_id == 0:
        activations = micro_batch_data # Input data for the first stage
    else:
        # Receive activations from the previous stage
        activations = recv_tensor(from_stage_id=my_stage_id - 1)

    # Forward pass through this stage's layers
    output_activations = pipeline_module(activations)

    if my_stage_id < num_stages - 1:
        # Send activations to the next stage
        send_tensor(output_activations, to_stage_id=my_stage_id + 1)
        # Need to store output_activations for backward pass
        # if using schedules like 1F1B
    else:
        # Last stage computes loss
        # Assuming target_labels are available
        loss = compute_loss(output_activations, target_labels)
        # Start backward pass
        loss.backward()
        # Send gradients back to previous stage
        # Simplified - actual grad depends on input to loss
        grad_to_send = output_activations.grad
        # send_tensor(grad_to_send, to_stage_id=my_stage_id - 1)

    # ... Backward pass logic continues for intermediate stages ...
    # receive gradients, compute local gradients, send gradients back

    return loss # Or relevant metrics

Note: This code is purely illustrative. Implementations require sophisticated scheduling logic (like 1F1B), handling activation checkpointing or recomputation, gradient accumulation across micro-batches, and communication primitives.

Trade-offs of Pipeline Parallelism

Advantages:

Scales Model Depth: Enables training models too deep to fit into the memory of a single device, even with tensor parallelism.
Memory Efficiency (Activations): Compared to pure data parallelism, PP can sometimes be more memory-efficient for activations because each device only holds activations for its segment of the model (though this benefit depends heavily on the micro-batching strategy).
Reduced Communication Volume (Potentially): Communication happens only between adjacent stages, typically involving activations or gradients. This can be less bandwidth-intensive than the frequent AllReduce operations in DP or the tensor split/gather operations in TP, especially if activations are smaller than model parameters.

Disadvantages:

Pipeline Bubble: Inherent idle time reduces hardware utilization unless mitigated by many micro-batches and complex scheduling.
Load Balancing Sensitivity: Performance is highly dependent on balancing the compute load across stages. Uneven stages create bottlenecks.
Complexity: Implementing efficient scheduling and managing the state across micro-batches adds significant complexity.
Latency: The sequential nature through stages adds latency to the processing of each micro-batch.

Pipeline Parallelism is rarely used in isolation for large models. Instead, it's often combined with Data Parallelism and Tensor Parallelism in hybrid approaches. For example, a common setup involves using Data Parallelism across different multi-GPU nodes, while employing Pipeline and/or Tensor Parallelism within each node to manage the model size across the node's GPUs. This allows scaling both the batch size (via DP) and the model size (via PP/TP).

Was this section helpful?