Masterclass
While Data Parallelism replicates the model and Tensor Parallelism splits individual operations within layers, Pipeline Parallelism (PP) takes a different approach to distributing the computational load. It partitions the entire model vertically, assigning consecutive layers to different devices, forming a processing pipeline much like an assembly line.
Imagine a large Transformer model with many layers. Instead of trying to fit all layers onto one device or splitting complex matrix multiplications within a layer, PP assigns, for example, layers 1-12 to GPU 0, layers 13-24 to GPU 1, layers 25-36 to GPU 2, and so on. Each group of layers executed on a single device is called a stage or partition.
In a pipeline parallel setup, a data batch is first broken down into smaller micro-batches. This is essential for keeping the pipeline stages utilized effectively, as we'll see shortly.
The process works as follows:
This flow allows different devices to work on different micro-batches simultaneously, parallelizing the computation across the model's depth.
A 4-stage pipeline showing forward (fwd) activation flow and backward (bwd) gradient flow across GPUs.
A significant challenge in pipeline parallelism is the pipeline bubble or idle time. At the beginning of processing a batch, only Stage 0 is active. Stage 1 must wait for Stage 0 to finish the first micro-batch, Stage 2 must wait for Stage 1, and so on. Similarly, during the backward pass, the initial stages become idle as they wait for gradients to arrive from later stages. This startup and wind-down period results in underutilization of the hardware.
The size of this bubble depends on the number of pipeline stages (S) and the number of micro-batches (M). If each micro-batch takes roughly the same time (t) to process through one stage, the total time for a simple sequential forward-then-backward schedule is approximately T≈(S+M−1)×t for the forward pass and similarly for the backward pass. The total useful work is S×M×t. The efficiency (fraction of time devices are busy) is roughly SM/(S(S+M−1)), which simplifies to M/(M+S−1) for large S. The bubble fraction (idle time) is approximately (S−1)/(M+S−1).
To minimize the bubble, we need to increase the number of micro-batches (M) relative to the number of stages (S). However, increasing M means smaller micro-batches, which might not fully utilize the compute capabilities of each GPU, and it also increases the total activation memory required across all micro-batches in flight.
To mitigate the bubble, various scheduling strategies have been developed beyond the simple "all forward, then all backward" approach (often associated with GPipe). A common and effective strategy is interleaved 1F1B (one forward, one backward) scheduling, popularized by frameworks like PipeDream.
In a 1F1B schedule, stages alternate between performing forward passes for upcoming micro-batches and backward passes for already completed micro-batches. Once a stage completes the forward pass for micro-batch i
, it might immediately perform the backward pass for micro-batch i-k
(where k
is related to the number of stages), assuming the gradients are available from the next stage. This keeps the devices busier and significantly reduces the idle time bubble compared to the naive schedule.
Comparison of GPU utilization over time for Naive vs Interleaved 1F1B schedules with 3 stages and multiple micro-batches. Blue (Fwd) represents forward passes, Red (Bwd) represents backward passes, Gray (Idle) represents bubble time. 1F1B significantly reduces idle time.
Implementing pipeline parallelism effectively requires careful consideration of several factors:
Here's a highly simplified PyTorch snippet illustrating the idea of stages and passing data (actual implementations are much more involved):
import torch
import torch.nn as nn
# --- Assume these are defined elsewhere ---
# get_my_stage_id() -> int
# get_num_stages() -> int
# get_device_for_stage(stage_id) -> torch.device
# send_tensor(tensor, to_stage_id)
# recv_tensor(from_stage_id) -> tensor
# global_micro_batch_size = ...
# model_layers = [...] # List of all model layers
class PipelineStage(nn.Module):
def __init__(self, layers, stage_id):
super().__init__()
self.layers = nn.ModuleList(layers)
self.stage_id = stage_id
self.device = get_device_for_stage(stage_id)
self.to(self.device)
def forward(self, x):
# Simplified: assumes x is received from previous stage
# if stage_id > 0
if x is not None:
x = x.to(self.device)
for layer in self.layers:
x = layer(x)
return x
# --- Partitioning the model (Example) ---
my_stage_id = get_my_stage_id()
num_stages = get_num_stages()
# Simplified balancing
layers_per_stage = len(model_layers) // num_stages
start_layer = my_stage_id * layers_per_stage
if my_stage_id < num_stages - 1:
end_layer = (my_stage_id + 1) * layers_per_stage
else:
end_layer = len(model_layers)
my_layers = model_layers[start_layer:end_layer]
pipeline_module = PipelineStage(my_layers, my_stage_id)
# --- Simplified Training Step (No Scheduling Logic) ---
def training_step(micro_batch_data):
activations = None
if my_stage_id == 0:
activations = micro_batch_data # Input data for the first stage
else:
# Receive activations from the previous stage
activations = recv_tensor(from_stage_id=my_stage_id - 1)
# Forward pass through this stage's layers
output_activations = pipeline_module(activations)
if my_stage_id < num_stages - 1:
# Send activations to the next stage
send_tensor(output_activations, to_stage_id=my_stage_id + 1)
# Need to store output_activations for backward pass
# if using schedules like 1F1B
else:
# Last stage computes loss
# Assuming target_labels are available
loss = compute_loss(output_activations, target_labels)
# Start backward pass
loss.backward()
# Send gradients back to previous stage
# Simplified - actual grad depends on input to loss
grad_to_send = output_activations.grad
# send_tensor(grad_to_send, to_stage_id=my_stage_id - 1)
# ... Backward pass logic continues for intermediate stages ...
# receive gradients, compute local gradients, send gradients back
return loss # Or relevant metrics
Note: This code is purely illustrative. Real implementations require sophisticated scheduling logic (like 1F1B), handling activation checkpointing or recomputation, gradient accumulation across micro-batches, and robust communication primitives.
Advantages:
Disadvantages:
Pipeline Parallelism is rarely used in isolation for large models. Instead, it's often combined with Data Parallelism and Tensor Parallelism in hybrid approaches. For example, a common setup involves using Data Parallelism across different multi-GPU nodes, while employing Pipeline and/or Tensor Parallelism within each node to manage the model size across the node's GPUs. This allows scaling both the batch size (via DP) and the model size (via PP/TP).
© 2025 ApX Machine Learning