Masterclass
While understanding the theory of tensor parallelism (TP) and pipeline parallelism (PP) is important, putting them into practice requires specialized tools. Megatron-LM is a prominent library developed by NVIDIA specifically designed to facilitate these complex parallelism strategies, particularly for training large Transformer models. It provides optimized implementations of parallel layers and manages the intricate communication patterns required. This section focuses on how to configure Megatron-LM to leverage TP and PP.
Configuration in Megatron-LM is typically handled through command-line arguments passed to the training script (like pretrain_gpt.py
or similar). Let's look at the specifics for setting up tensor and pipeline parallelism.
Tensor parallelism, sometimes called intra-layer model parallelism, involves splitting the execution of individual large layers (like the weight matrices in MLP blocks or attention mechanisms) across multiple GPUs. Megatron-LM provides specialized layer implementations (e.g., ColumnParallelLinear
, RowParallelLinear
) that handle this partitioning and the necessary communication (like AllReduce or AllGather) automatically.
The primary argument to enable and control TP is --tensor-model-parallel-size
(or a similar variant depending on the specific Megatron-LM version or fork). This value, let's call it TPsize​, specifies the number of GPUs across which each tensor-parallel layer will be split.
For example, if you set --tensor-model-parallel-size 4
, a large linear layer's weight matrix would be partitioned into 4 column or row segments, with each segment residing on a different GPU within the tensor-parallel group. During the forward and backward passes, Megatron-LM orchestrates the required data exchanges between these 4 GPUs.
# Illustration within a Megatron-LM context
# (Actual usage involves Megatron's model definition utilities)
# Assume tp_size = 2
# GPU 0 holds W_A, GPU 1 holds W_B where W = [W_A W_B]
tp_group = get_tensor_model_parallel_group()
# Function to get the TP communicator group
# ColumnParallelLinear: Splits weight columns,
# input is broadcast, output is gathered
# Input X -> [X W_A, X W_B] -> AllGather -> Output Y
linear_column_parallel = ColumnParallelLinear(
input_size,
output_size,
tp_group=tp_group,
)
# RowParallelLinear: Splits weight rows,
# input is split, output is reduced
# Input X -> [X_A, X_B] -> [X_A W_A, X_B W_B] -> AllReduce -> Output Y
linear_row_parallel = RowParallelLinear(
input_size,
output_size,
tp_group=tp_group,
)
# Example usage within a transformer block (simplified)
# mlp_output = linear_row_parallel(dropout(
# linear_column_parallel(layer_input)))
Important Considerations for TP:
--hidden-size
) and the number of attention heads (--num-attention-heads
). Megatron-LM's parallel layers rely on this for partitioning.Pipeline parallelism involves partitioning the model's layers sequentially across different GPUs, forming a pipeline. Each GPU (or set of GPUs if combined with TP/DP) forms a "stage" in the pipeline, responsible for executing only a subset of the model's layers.
The main argument for configuring PP is --pipeline-model-parallel-size
(or similar). This value, PPsize​, determines the number of stages in the pipeline. For instance, --pipeline-model-parallel-size 4
creates a 4-stage pipeline.
To keep the pipeline stages utilized and minimize idle time ("bubbles"), Megatron-LM employs microbatching. The overall training batch is split into smaller microbatches that flow through the pipeline stages concurrently. The number of microbatches is a critical tuning parameter, often controlled by --num-microbatches
or calculated based on the global batch size, microbatch size, and data parallel degree. A common scheduling approach is 1F1B (one forward pass, one backward pass per microbatch), which helps balance computation and memory usage.
Flow of microbatches (MB1, MB2, MB3) through forward passes (Fwd) of a 3-stage pipeline (PPsize​=3). Activations (Actvs) are passed between stages. The backward pass follows a similar, reversed flow.
Important Considerations for PP:
--num-layers
) should ideally be divisible by PPsize​ for balanced load distribution, although Megatron-LM can handle uneven distributions.Megatron-LM excels at combining different parallelism strategies. Often, TP and PP are used together. In such a setup, the total number of GPUs involved in model parallelism is TPsize​×PPsize​. Each pipeline stage might itself consist of multiple GPUs working together using tensor parallelism. Data parallelism (DP) can then be layered on top, replicating this TP/PP structure.
The total number of GPUs used for training would be Ngpus​=DPsize​×TPsize​×PPsize​.
Example configuration with 8 GPUs: PPsize​=4 and TPsize​=2. Each pipeline stage uses 2 GPUs for tensor parallelism. Communication occurs point-to-point between stages (PP) and using collective operations within each TP group (TP).
Here's a simplified example of how command-line arguments might look when configuring Megatron-LM for a hypothetical GPT-style model using both TP and PP (assuming a total of 8 GPUs, with TPsize​=2 and PPsize​=4, and data parallelism DPsize​=1 for simplicity here):
# Example training command using Megatron-LM arguments
# (Actual script name and other args may vary)
python pretrain_gpt.py \
--num-layers 24 \
--hidden-size 2048 \
--num-attention-heads 32 \
--seq-length 2048 \
\
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 4 \
\
--micro-batch-size 4 \
--global-batch-size 128 \
# Assuming DP size is 1, num_microbatches = global_batch_size / (micro_batch_size * dp_size)
# Needs careful calculation based on DP size. Let's assume DP=1 here.
# Effective batch per pipeline = micro_batch_size * num_microbatches
# global_batch_size = micro_batch_size * num_microbatches * dp_size
# num_microbatches = global_batch_size / (micro_batch_size * dp_size) = 128 / (4 * 1) = 32
--num-microbatches 32 \
\
--optimizer adam \
--learning-rate 1.0e-4 \
--weight-decay 0.01 \
--clip-grad 1.0 \
\
--train-data-path <path_to_train_data> \
--valid-data-path <path_to_valid_data> \
--tokenizer-type SentencePieceTokenizer \
--tokenizer-model <path_to_tokenizer> \
\
--distributed-backend nccl \
--save <path_to_save_checkpoints> \
--load <path_to_load_checkpoints> \
--log-interval 10 \
--save-interval 1000 \
--eval-interval 100 \
--num-workers 2
In this example:
--tensor-model-parallel-size 2
: Splits layers like Linear and Attention across 2 GPUs.--pipeline-model-parallel-size 4
: Splits the 24 layers across 4 sequential stages (roughly 6 layers per stage).--num-microbatches 32
: Used to feed the 4-stage pipeline efficiently, calculated based on global and micro batch sizes and data parallel degree.Configuring these parameters correctly is essential for balancing computation, memory usage, and communication overhead across the available hardware. Megatron-LM provides the underlying mechanisms, but determining the optimal TPsize​, PPsize​, and number of microbatches often requires experimentation based on the specific model architecture, hardware setup (GPU type, interconnects), and desired batch size.
© 2025 ApX Machine Learning