Model and Pipeline Parallelism for Large Models

While data parallelism effectively scales training by distributing data, it relies on a fundamental assumption: the model itself can fit into the memory of a single accelerator. For state-of-the-art models, especially in natural language processing and computer vision, this assumption no longer holds. When a model's parameters, gradients, and optimizer states exceed the available VRAM of a single GPU, a different approach is required. Strategies that partition the model itself across multiple devices address this challenge.

Inter-Layer Model Parallelism (Tensor Parallelism)

The most direct way to handle a single, massive layer that won't fit on one GPU is to split the layer's computation, a technique often called tensor parallelism. For example, a large linear layer is defined by the equation $Y = XA$ , where $X$ is the input activation and $A$ is the weight matrix. If matrix $A$ is too large for one device, we can partition it across multiple GPUs.

For example, splitting a linear layer's weight matrix $A$ column-wise across two GPUs ( $A = [A_1, A_2]$ ) allows us to compute parts of the output in parallel.

Y = X[A_1, A_2] = [XA_1, XA_2]

GPU 1 computes $Y_1 = XA_1$ and GPU 2 computes $Y_2 = XA_2$ . The input $X$ is broadcast to both GPUs. After the parallel computation, the results $Y_1$ and $Y_2$ are gathered to form the final output $Y$ . This process requires significant communication: the input must be sent to all devices, and the partial results must be combined.

A diagram of tensor parallelism where a single layer's weight matrix is split across two GPUs. The input is broadcast, and partial results are combined via an All-Gather operation.

The backward pass follows a similar pattern in reverse. The gradient with respect to the output, $\nabla_Y$ , is split and sent to the respective GPUs. Each GPU computes its local weight gradient, and the gradients with respect to the input are combined using a Reduce-Scatter operation before being passed to the previous layer.

This fine-grained parallelism introduces high-frequency communication, making it highly sensitive to interconnect bandwidth. It is most effective inside a multi-GPU server with high-speed links like NVLink.

Pipeline Parallelism

For very deep models, another strategy is to partition the model between layers rather than within them. This is known as pipeline parallelism. In this approach, contiguous blocks of layers, called stages, are placed on different accelerators.

For instance, in a 32-layer model running on 4 GPUs:

GPU 0 holds layers 1-8.
GPU 1 holds layers 9-16.
GPU 2 holds layers 17-24.
GPU 3 holds layers 25-32.

A data batch is fed into GPU 0. Once GPU 0 completes its forward pass for layers 1-8, it passes the resulting activations to GPU 1, which then begins its computation. This process continues sequentially down the line.

A sequential flow in naive pipeline parallelism. Each GPU stage waits for the previous one to finish, leading to significant idle time.

Addressing the Pipeline Bubble

The simple sequential approach shown above is inefficient. While GPU 0 processes a batch, GPUs 1, 2, and 3 are completely idle. This idle time, known as the "pipeline bubble," severely harms overall hardware utilization.

To mitigate this, the input data batch is split into smaller micro-batches. As soon as GPU 0 finishes processing the first micro-batch and passes it to GPU 1, it can immediately start processing the second micro-batch. This allows the GPUs to work in parallel on different micro-batches, creating a true pipeline.

GPU utilization over time for a 4-stage pipeline with 4 micro-batches. "Fwd" denotes the forward pass and "Bwd" the backward pass. The initial startup and final flush periods (the "bubble") are visible as gray spaces, but the central period shows high, overlapping utilization.

This schedule, often called GPipe, improves utilization but isn't perfect. The bubble still exists at the start (the ramp-up phase) and end (the ramp-down phase) of processing a full batch. More advanced schedulers, like the one used in DeepSpeed, can further optimize this by altering the backward pass order to fill more of the bubble.

Choosing the Right Strategy

Model and pipeline parallelism are not mutually exclusive; they solve different aspects of the same problem and are frequently used together. The choice depends on model architecture and hardware constraints.

Model Parallelism
- When to use: When individual layers are too large to fit in a single GPU's memory. This is common for attention heads and MLP blocks in large language models.
- Communication Pattern: High-frequency, low-latency communication (e.g., all-reduce) is required within a forward/backward pass of a single layer. It is best suited for tightly coupled GPUs within a single node (e.g., over NVLink).
Pipeline Parallelism
- When to use: For very deep models that can be logically partitioned into sequential stages. It is excellent for scaling across multiple nodes over slower network interconnects like InfiniBand or Ethernet.
- Communication Pattern: Lower-frequency communication, consisting of passing larger activation tensors between stages once per micro-batch.

In many production scenarios, a hybrid approach is the most effective solution. For example, a large model might be split into several stages using pipeline parallelism across different nodes. Within each node, the GPUs assigned to a single stage might use tensor parallelism to manage the memory of that stage's layers and data parallelism to process micro-batches faster. This combination allows for scaling to massive model sizes across large clusters of accelerators.

Was this section helpful?

References

GPipe: Efficient Training of Giant Models using Pipeline Parallelism, Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, Zhifeng Chen, 2019 Advances in Neural Information Processing Systems, Vol. 32 (NeurIPS) DOI: 10.5555/3454287.3455115 - Introduces pipeline parallelism with micro-batching to improve device utilization, addressing the 'pipeline bubble' for training large models.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019 arXiv preprint arXiv:1909.08053 DOI: 10.48550/arXiv.1909.08053 - Describes how to implement tensor (inter-layer) and pipeline parallelism for very large language models, including discussions on communication and hybrid approaches.
DeepSpeed: Pipeline Parallelism, DeepSpeed Team, 2023 (Microsoft) - Provides practical guidance and advanced scheduling techniques for implementing pipeline parallelism within the DeepSpeed framework.