Practice: Configuring a Distributed Training Job

Translating parallelism strategies and optimization principles into a configuration for a training framework is required for their application. This walkthrough demonstrates setting up a distributed training job for a large-scale Mixture of Experts model. The focus is on the essential configuration parameters to tune, using a training script that accepts a configuration file.

Our scenario involves training an MoE model on a cluster of two nodes, each equipped with eight GPUs, for a total of 16 GPUs. We will use a hybrid parallelism strategy to efficiently scale the training process.

Defining the Parallelism Strategy

For large MoE models, a single parallelism technique is rarely sufficient. We must combine them to address different system bottlenecks. A common and effective strategy combines data parallelism (DP) and expert parallelism (EP).

Expert Parallelism (EP): The experts of each MoE layer are sharded across a group of GPUs. If we have 64 experts and an expert parallel size of 8, each of the 8 GPUs in the group holds 8 unique experts.
Data Parallelism (DP): The entire model (including all sharded experts) is replicated. Each replica processes a different batch of data.

For our 16-GPU setup, a sensible configuration is an expert-parallel size of 8 and a data-parallel size of 2.

Expert Parallel Size (EP) = 8
Data Parallel Size (DP) = 2
Total GPUs = EP * DP = 8 * 2 = 16

This means we will have two data-parallel replicas of the model. Within each replica, the experts are sharded across 8 GPUs. This setup is effective because it allows all-to-all communication for expert processing to occur within a single node (which typically has high-speed interconnects like NVLink), while the slower inter-node communication is only used for synchronizing gradients between the two data-parallel replicas.

Diagram of a hybrid parallelism setup with 2 data parallel replicas and an expert parallel size of 8. Experts are sharded within each node, and gradients are synchronized across nodes.

Configuring MoE Hyperparameters

With the parallelism strategy defined, we now turn to the MoE-specific hyperparameters. These settings directly control the trade-offs between computational efficiency, memory usage, and model performance.

Capacity Factor

The capacity factor is one of the most important MoE hyperparameters. It determines the size of the buffer allocated for each expert to process tokens. The ideal, perfectly balanced number of tokens per expert is (batch_size * sequence_length) / num_experts. The capacity factor is a multiplier on this ideal value.

\text{ExpertCapacity} = \text{capacity\_factor} \times \frac{\text{TokensInBatch}}{\text{NumberOfExperts}}

A capacity_factor of 1.0 means the buffer is exactly the size of the ideal distribution. However, router assignments are never perfect, so some experts will be assigned more tokens than others. A factor greater than 1.0 provides a buffer to handle this imbalance.

Low Capacity Factor (e.g., 1.25): Reduces memory usage and padding-related computation, as expert buffers are smaller. The risk is that more tokens will be "dropped" if an expert's buffer overflows, which can harm model quality.
High Capacity Factor (e.g., 2.0): Minimizes the number of dropped tokens, ensuring that most expert computations contribute to the final output. The cost is increased GPU memory allocation and potentially wasted computation on padding tokens.

Choosing the right value requires experimentation. A common practice is to start with a value like 1.25 or 1.5 and monitor the percentage of dropped tokens during early training phases. If the dropped token rate is consistently above 1-2%, you should increase the capacity factor.

The relationship between capacity factor, the percentage of dropped tokens, and the required GPU memory. Increasing the factor reduces token loss at the cost of higher memory consumption.

Load Balancing and Router Losses

As discussed earlier in the chapter, auxiliary losses are essential for stable MoE training. Their weights are hyperparameters that you must configure.

load_balance_loss_coeff: This coefficient scales the auxiliary loss that encourages the gating network to distribute tokens evenly across all experts. A typical starting value is in the range of 0.01. If you observe that a few experts are consistently over-utilized while others are idle, you may need to increase this value.
router_z_loss_coeff: This coefficient scales the loss term that penalizes large logit values from the gating network. It acts as a regularizer to improve numerical stability. A small value like 0.001 is often a good starting point.

Example Training Configuration

Let's bring these elements together into a single configuration file. Here is an example of what a Python dictionary or JSON file used by a training script might look like for our 16-GPU job.

# moe_training_config.py

config = {
    # Model architecture
    "model": {
        "num_layers": 32,
        "hidden_size": 4096,
        "num_attention_heads": 32,
        "moe_layer_frequency": 2,  # Use an MoE layer every 2 Transformer blocks
        "moe": {
            "num_experts": 64,
            "num_experts_per_tok": 2, # Top-2 gating
        }
    },

    # Training settings
    "training": {
        "batch_size_per_gpu": 4,
        "sequence_length": 2048,
        "optimizer": "AdamW",
        "learning_rate": 1e-4,
        "precision": "bfloat16", # Use BFloat16 for memory and speed
    },

    # Distributed training configuration
    "distributed": {
        "data_parallel_size": 2,
        "expert_parallel_size": 8,
        "tensor_parallel_size": 1, # No tensor parallelism in this example
    },

    # MoE-specific optimization parameters
    "moe_optimization": {
        # Set capacity factor to 25% above the ideal load
        "capacity_factor": 1.25,

        # Coefficient for the load balancing auxiliary loss
        "load_balance_loss_coeff": 0.01,

        # Coefficient for the router logit regularization loss
        "router_z_loss_coeff": 0.001,
    }
}

Launching and Monitoring the Job

Once the configuration is defined, you would launch the distributed job using a tool like torchrun or a framework-specific launcher like deepspeed.

The work does not end at launch. Monitoring the training run is just as important as the initial setup. Pay close attention to these metrics, which are often logged by advanced training frameworks:

Dropped Tokens Percentage: As mentioned, this is your primary indicator for tuning the capacity_factor. Log this value every training step.
Expert Utilization: This metric shows how many tokens are being routed to each expert. A well-balanced model will show a relatively flat distribution of tokens across all experts. If the distribution is highly skewed, it indicates expert collapse or poor load balancing, suggesting an increase in the load_balance_loss_coeff.

A comparison of token distribution across experts. Poor balancing shows a few experts dominating, while good balancing, achieved with a higher load balancing loss coefficient, results in a more even distribution.

Loss Curves: Monitor the total loss, the main cross-entropy loss, and the auxiliary loss separately. Watch for sudden spikes or instability in the auxiliary loss, which could point to issues with the router that might be addressed by tuning router_z_loss_coeff.

By methodically defining your configuration and carefully monitoring important metrics, you can successfully navigate the training of these powerful sparse models.

Was this section helpful?

References

Sparsely-Gated Mixture-of-Experts Layers, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017 arXiv preprint arXiv:1701.06538 DOI: 10.48550/arXiv.1701.06538 - This paper introduces the Mixture-of-Experts layer and outlines methods for balancing expert load, directly relevant to the auxiliary losses discussed in the section.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, Noam Shazeer, 2021 Journal of Machine Learning Research DOI: 10.48550/arXiv.2101.03961 - This work demonstrates scaling of MoE models and addresses training considerations, including the role of auxiliary losses (like Z-loss) and capacity for managing token routing.