Techniques for Distributed Training

When a single GPU is no longer sufficient for your training needs, either because the dataset is massive or the model itself is enormous, you must scale your training job across multiple processors. This is the domain of distributed training. By distributing the workload, you can drastically reduce the time it takes to train a model, moving from weeks to days or from days to hours. However, this is not as simple as just adding more hardware. Effective distributed training requires a strategy for dividing the work and coordinating the results.

The primary challenge in distributed training is communication. Every time GPUs need to synchronize, they spend time sending data to each other instead of performing calculations. Your goal is to maximize computation while minimizing this communication overhead. The two principal strategies for achieving this are data parallelism and model parallelism.

Data Parallelism: One Model, Many Data Slices

Data parallelism is the most common and often the most straightforward strategy for distributed training. The core idea is simple: you replicate your entire model on each available GPU, but you feed each GPU a different slice of the input data. This is an effective way to process a much larger batch of data in the same amount of time.

The process for a single training step using data parallelism typically follows these steps:

Replicate: The model is copied to the memory of each participating GPU.
Split: A large "global" batch of data is divided into smaller mini-batches. Each GPU receives one mini-batch.
Compute: Each GPU independently performs a forward pass to generate predictions and a backward pass to compute gradients for its local mini-batch. At this point, each GPU has a different set of gradients based on the data it saw.
Synchronize: This is the most important step. The gradients from all GPUs must be aggregated. A communication pattern called an All-Reduce operation is used to sum the gradients from all GPUs and distribute the result back to every GPU. The averaged gradient is what will be used to update the model's weights.
Update: Each GPU uses the averaged gradient to update its local copy of the model's weights. Because they all start with the same weights and receive the same averaged gradient, the model replicas remain identical after the update.

In data parallelism, the model is replicated on each GPU. A batch of data is split, and each GPU processes its portion. The resulting gradients are then aggregated before updating each model replica.

The main benefit of data parallelism is its ability to accelerate training by increasing the total amount of data processed per second. However, its effectiveness is limited by the communication bandwidth between GPUs (e.g., NVLink or PCIe) and across machines (networking). It also does not solve the problem of a model being too large to fit in a single GPU's memory, since the entire model must be loaded onto each device.

Model Parallelism: One Model, Split Across GPUs

What happens when your model is so large, with billions of parameters, that it cannot fit into the memory of even the largest available GPU? This is where model parallelism becomes necessary. Instead of replicating the model, you split the model itself across multiple GPUs.

Think of it like an assembly line. Each GPU is responsible for executing a specific subset of the model's layers.

Partition: The model's layers are divided and placed onto different GPUs. For instance, in a 24-layer transformer, GPU 0 might hold layers 1-12, and GPU 1 might hold layers 13-24.
Forward Pass: A batch of data is fed to GPU 0. It computes the activations for its layers and passes the result to GPU 1. GPU 1 then computes its layers and produces the final output.
Backward Pass: The process is reversed. Gradients are calculated on GPU 1 and passed back to GPU 0, which then computes the gradients for its own layers.

In model parallelism, layers of the model are split across multiple GPUs. Data flows sequentially through the GPUs during the forward pass (activations), and gradients flow in reverse during the backward pass.

The main drawback of this naive approach is GPU underutilization. While GPU 1 is working, GPU 0 is idle, and vice-versa. This is often called a "pipeline bubble". More advanced techniques like pipeline parallelism (e.g., GPipe) address this by splitting the data batch into even smaller micro-batches and creating a pipeline, allowing the GPUs to work on different micro-batches simultaneously to reduce idle time. Implementing model parallelism is significantly more complex than data parallelism and is generally reserved for situations where it is the only option.

Making the Choice: Data vs. Model Parallelism

For most scenarios, the choice is clear:

Use Data Parallelism if your model fits into a single GPU's memory. This is the default and most efficient way to speed up training on large datasets.
Use Model Parallelism only when your model is too large for a single GPU. This is a solution for memory constraints, not just for speed.

In some extreme cases, such as training state-of-the-art Large Language Models (LLMs), these techniques are combined. Hybrid parallelism might use model parallelism to split a massive model across several GPUs within a single server node, and then use data parallelism to replicate that multi-GPU setup across many different server nodes.

Frameworks for Distributed Training

Manually managing gradient synchronization and data transfer is complex and error-prone. Fortunately, major deep learning frameworks provide high-level abstractions to handle this for you.

PyTorch: The primary tool is torch.nn.parallel.DistributedDataParallel (DDP). You wrap your model with DDP, and it automatically handles distributing data, aggregating gradients using the efficient All-Reduce algorithm, and keeping the model replicas in sync.
TensorFlow: The tf.distribute.Strategy API provides a flexible way to distribute training. For data parallelism on one or more machines, tf.distribute.MirroredStrategy and tf.distribute.MultiWorkerMirroredStrategy are the common choices. You define the strategy and then build and compile your model within the strategy's scope, and TensorFlow handles the distribution logic.

These tools abstract away the low-level details, allowing you to convert a single-GPU training script to a distributed one with relatively few code changes. The next step is to put these strategies into practice, which we will explore in the hands-on labs.

Was this section helpful?

References

torch.nn.parallel.DistributedDataParallel, PyTorch Documentation, 2023 - Official documentation for PyTorch's primary tool for data-parallel distributed training, detailing its usage and mechanisms.
Distributed training with TensorFlow, TensorFlow Documentation, 2023 - Official guide to TensorFlow's distributed training strategies, including MirroredStrategy for data parallelism.
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen, 2019 Advances in Neural Information Processing Systems 32, Vol. 32 (NeurIPS) DOI: 10.5555/3454288.3454378 - Introduces a pioneering pipeline parallelism technique that optimizes GPU utilization for large models by addressing the 'pipeline bubble' problem.