When a large language model's parameters and intermediate activations become too large to fit into the memory of a single accelerator (like a GPU or TPU), data parallelism alone isn't sufficient. While data parallelism replicates the model on multiple devices and processes different data batches on each, the entire model must still reside on each device. We need strategies to split the model itself across devices. This is the domain of model parallelism.
Model parallelism partitions the layers or tensors of a model across multiple accelerators. Instead of replicating the model, different parts of the model reside on different devices, allowing the collective memory of the cluster to hold models that are orders of magnitude larger than what a single device can handle. There are two primary strategies for implementing model parallelism: pipeline parallelism and tensor parallelism.
Pipeline parallelism involves partitioning the model's layers sequentially across multiple devices. Think of it like an assembly line: each device (or group of devices) forms a "stage" responsible for executing a specific subset of the model's layers.
Imagine a model with 12 layers distributed across 3 GPUs:
The input data flows through these stages sequentially. GPU 0 processes the first batch, sends its output (activations) to GPU 1, which processes it and sends its output to GPU 2, and so on. For the backward pass, gradients flow in the reverse direction.
A simplified view of pipeline parallelism, showing sequential processing across GPUs.
A significant challenge with naive pipeline parallelism is the "pipeline bubble." While GPU 1 waits for the output of GPU 0 for the first batch, and GPU 2 waits for GPU 1, GPUs 1 and 2 are initially idle. Similarly, after GPU 0 processes the last part of its batch, it becomes idle while the later stages finish. This idle time represents wasted compute resources.
To mitigate this, techniques like micro-batching are employed (popularized by GPipe and improved by systems like PipeDream). The main data batch is split into smaller micro-batches. As soon as GPU 0 finishes processing the first micro-batch, it sends the activations to GPU 1 and immediately starts processing the second micro-batch. This allows multiple micro-batches to be "in flight" simultaneously across the pipeline stages, significantly reducing idle time and improving hardware utilization.
Conceptual timeline showing micro-batches (MB) flowing through a 3-stage pipeline. Notice how GPU 0 starts MB2 while GPU 1 processes MB1, reducing idle time compared to processing the whole batch at once. Backward passes (Bwd) follow the forward passes (Fwd).
Pipeline parallelism is effective at reducing the memory footprint per GPU because each GPU only holds a fraction of the model's layers. However, it introduces communication latency between stages and requires careful load balancing to ensure stages have roughly equal computational cost.
While pipeline parallelism splits layers between devices, tensor parallelism splits the computation within a layer (specifically, its large weight matrices) across multiple devices. This is particularly relevant for Transformer models, where the multi-head attention and feed-forward network (FFN) layers contain large matrix multiplications.
Consider a large matrix multiplication Y=XA within a Transformer layer. If matrix A (representing model weights) is too large for a single GPU's memory, we can split it column-wise across two GPUs: A=[A1,A2]. The computation becomes:
Y=X[A1,A2]=[XA1,XA2]GPU 0 computes Y1=XA1 using its portion of the weights A1, and GPU 1 computes Y2=XA2 using A2. The input X is typically broadcast or made available to both GPUs. The results Y1 and Y2 can then be gathered if needed for subsequent operations.
Alternatively, matrix A can be split row-wise. Consider Y=XA. Split A row-wise A=[A1A2]. Then:
Y=X[A1A2]This is common in feed-forward layers where the operation might be Y=GeLU(XA)B. Here, A can be split column-wise as [A1,A2], and B can be split row-wise as [B1B2]. GPU 0 computes XA1 and GPU 1 computes XA2. Then, an all-gather operation might be needed before applying GeLU, or the computation can proceed: GPU 0 computes Y1=GeLU(XA1)B1 GPU 1 computes Y2=GeLU(XA2)B2 Finally, an all-reduce operation sums the results: Y=Y1+Y2.
Conceptual flow for tensor parallelism splitting a matrix multiplication XA across two GPUs. Communication (Gather/Reduce) is needed to combine partial results.
Tensor parallelism requires significant communication within the layer's computation, typically using collective communication operations like all-gather
or reduce-scatter
. This demands high-bandwidth interconnects between the participating GPUs (e.g., NVLink). Frameworks like NVIDIA's Megatron-LM are specifically designed to implement tensor parallelism efficiently for Transformer models.
Another dimension sometimes discussed alongside tensor parallelism is Sequence Parallelism. When using tensor parallelism, activations often need to be gathered across devices, which can be memory-intensive for long sequences. Sequence parallelism provides strategies to split the activations along the sequence length dimension, distributing this memory burden across the tensor-parallel devices. This further enables scaling to longer context lengths.
In practice, training the largest models rarely relies on a single parallelism technique. Optimal performance is usually achieved by combining strategies:
A hybrid parallelism strategy combining data parallelism across nodes, pipeline parallelism across stages within a node, and tensor parallelism (TP) within stages.
Implementing these model parallelism strategies manually is complex and error-prone. It requires careful handling of data movement, communication synchronization, and gradient computation across distributed devices.
Mastering model parallelism involves understanding these trade-offs and leveraging the right frameworks to distribute computation and memory effectively. It is a fundamental operational requirement for pushing the boundaries of model scale in LLM training.
© 2025 ApX Machine Learning