While data parallelism effectively scales training by distributing data, it relies on a fundamental assumption: the model itself can fit into the memory of a single accelerator. For state-of-the-art models, especially in natural language processing and computer vision, this assumption no longer holds. When a model's parameters, gradients, and optimizer states exceed the available VRAM of a single GPU, a different approach is required. Strategies that partition the model itself across multiple devices address this challenge.
The most direct way to handle a single, massive layer that won't fit on one GPU is to split the layer's computation, a technique often called tensor parallelism. Consider a large linear layer, defined by the equation Y=XA, where X is the input activation and A is the weight matrix. If matrix A is too large for one device, we can partition it across multiple GPUs.
For example, splitting a linear layer's weight matrix A column-wise across two GPUs (A=[A1,A2]) allows us to compute parts of the output in parallel.
Y=X[A1,A2]=[XA1,XA2]GPU 1 computes Y1=XA1 and GPU 2 computes Y2=XA2. The input X is broadcast to both GPUs. After the parallel computation, the results Y1 and Y2 are gathered to form the final output Y. This process requires significant communication: the input must be sent to all devices, and the partial results must be combined.
A diagram of tensor parallelism where a single layer's weight matrix is split across two GPUs. The input is broadcast, and partial results are combined via an All-Gather operation.
The backward pass follows a similar pattern in reverse. The gradient with respect to the output, ∇Y, is split and sent to the respective GPUs. Each GPU computes its local weight gradient, and the gradients with respect to the input are combined using a Reduce-Scatter operation before being passed to the previous layer.
This fine-grained parallelism introduces high-frequency communication, making it highly sensitive to interconnect bandwidth. It is most effective inside a multi-GPU server with high-speed links like NVLink.
For very deep models, another strategy is to partition the model between layers rather than within them. This is known as pipeline parallelism. In this approach, contiguous blocks of layers, called stages, are placed on different accelerators.
For instance, in a 32-layer model running on 4 GPUs:
A data batch is fed into GPU 0. Once GPU 0 completes its forward pass for layers 1-8, it passes the resulting activations to GPU 1, which then begins its computation. This process continues sequentially down the line.
A sequential flow in naive pipeline parallelism. Each GPU stage waits for the previous one to finish, leading to significant idle time.
The simple sequential approach shown above is inefficient. While GPU 0 processes a batch, GPUs 1, 2, and 3 are completely idle. This idle time, known as the "pipeline bubble," severely harms overall hardware utilization.
To mitigate this, the input data batch is split into smaller micro-batches. As soon as GPU 0 finishes processing the first micro-batch and passes it to GPU 1, it can immediately start processing the second micro-batch. This allows the GPUs to work in parallel on different micro-batches, creating a true pipeline.
GPU utilization over time for a 4-stage pipeline with 4 micro-batches. "Fwd" denotes the forward pass and "Bwd" the backward pass. The initial startup and final flush periods (the "bubble") are visible as gray spaces, but the central period shows high, overlapping utilization.
This schedule, often called GPipe, improves utilization but isn't perfect. The bubble still exists at the start (the ramp-up phase) and end (the ramp-down phase) of processing a full batch. More advanced schedulers, like the one used in DeepSpeed, can further optimize this by altering the backward pass order to fill more of the bubble.
Model and pipeline parallelism are not mutually exclusive; they solve different aspects of the same problem and are frequently used together. The choice depends on model architecture and hardware constraints.
Model Parallelism
all-reduce) is required within a forward/backward pass of a single layer. It is best suited for tightly coupled GPUs within a single node (e.g., over NVLink).Pipeline Parallelism
In many production scenarios, a hybrid approach is the most effective solution. For example, a large model might be split into several stages using pipeline parallelism across different nodes. Within each node, the GPUs assigned to a single stage might use tensor parallelism to manage the memory of that stage's layers and data parallelism to process micro-batches faster. This combination allows for scaling to massive model sizes across large clusters of accelerators.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with