Translating parallelism strategies and optimization principles into a configuration for a training framework is required for their application. This walkthrough demonstrates setting up a distributed training job for a large-scale Mixture of Experts model. The focus is on the essential configuration parameters to tune, using a training script that accepts a configuration file.
Our scenario involves training an MoE model on a cluster of two nodes, each equipped with eight GPUs, for a total of 16 GPUs. We will use a hybrid parallelism strategy to efficiently scale the training process.
For large MoE models, a single parallelism technique is rarely sufficient. We must combine them to address different system bottlenecks. A common and effective strategy combines data parallelism (DP) and expert parallelism (EP).
For our 16-GPU setup, a sensible configuration is an expert-parallel size of 8 and a data-parallel size of 2.
Expert Parallel Size (EP) = 8Data Parallel Size (DP) = 2Total GPUs = EP * DP = 8 * 2 = 16This means we will have two data-parallel replicas of the model. Within each replica, the experts are sharded across 8 GPUs. This setup is effective because it allows all-to-all communication for expert processing to occur within a single node (which typically has high-speed interconnects like NVLink), while the slower inter-node communication is only used for synchronizing gradients between the two data-parallel replicas.
Diagram of a hybrid parallelism setup with 2 data parallel replicas and an expert parallel size of 8. Experts are sharded within each node, and gradients are synchronized across nodes.
With the parallelism strategy defined, we now turn to the MoE-specific hyperparameters. These settings directly control the trade-offs between computational efficiency, memory usage, and model performance.
The capacity factor is one of the most important MoE hyperparameters. It determines the size of the buffer allocated for each expert to process tokens. The ideal, perfectly balanced number of tokens per expert is (batch_size * sequence_length) / num_experts. The capacity factor is a multiplier on this ideal value.
A capacity_factor of 1.0 means the buffer is exactly the size of the ideal distribution. However, router assignments are never perfect, so some experts will be assigned more tokens than others. A factor greater than 1.0 provides a buffer to handle this imbalance.
Choosing the right value requires experimentation. A common practice is to start with a value like 1.25 or 1.5 and monitor the percentage of dropped tokens during early training phases. If the dropped token rate is consistently above 1-2%, you should consider increasing the capacity factor.
The relationship between capacity factor, the percentage of dropped tokens, and the required GPU memory. Increasing the factor reduces token loss at the cost of higher memory consumption.
As discussed earlier in the chapter, auxiliary losses are essential for stable MoE training. Their weights are hyperparameters that you must configure.
load_balance_loss_coeff: This coefficient scales the auxiliary loss that encourages the gating network to distribute tokens evenly across all experts. A typical starting value is in the range of 0.01. If you observe that a few experts are consistently over-utilized while others are idle, you may need to increase this value.router_z_loss_coeff: This coefficient scales the loss term that penalizes large logit values from the gating network. It acts as a regularizer to improve numerical stability. A small value like 0.001 is often a good starting point.Let's bring these elements together into a single configuration file. Here is an example of what a Python dictionary or JSON file used by a training script might look like for our 16-GPU job.
# moe_training_config.py
config = {
# Model architecture
"model": {
"num_layers": 32,
"hidden_size": 4096,
"num_attention_heads": 32,
"moe_layer_frequency": 2, # Use an MoE layer every 2 Transformer blocks
"moe": {
"num_experts": 64,
"num_experts_per_tok": 2, # Top-2 gating
}
},
# Training settings
"training": {
"batch_size_per_gpu": 4,
"sequence_length": 2048,
"optimizer": "AdamW",
"learning_rate": 1e-4,
"precision": "bfloat16", # Use BFloat16 for memory and speed
},
# Distributed training configuration
"distributed": {
"data_parallel_size": 2,
"expert_parallel_size": 8,
"tensor_parallel_size": 1, # No tensor parallelism in this example
},
# MoE-specific optimization parameters
"moe_optimization": {
# Set capacity factor to 25% above the ideal load
"capacity_factor": 1.25,
# Coefficient for the load balancing auxiliary loss
"load_balance_loss_coeff": 0.01,
# Coefficient for the router logit regularization loss
"router_z_loss_coeff": 0.001,
}
}
Once the configuration is defined, you would launch the distributed job using a tool like torchrun or a framework-specific launcher like deepspeed.
The work does not end at launch. Monitoring the training run is just as important as the initial setup. Pay close attention to these metrics, which are often logged by advanced training frameworks:
capacity_factor. Log this value every training step.load_balance_loss_coeff.A comparison of token distribution across experts. Poor balancing shows a few experts dominating, while good balancing, achieved with a higher load balancing loss coefficient, results in a more even distribution.
router_z_loss_coeff.By methodically defining your configuration and carefully monitoring important metrics, you can successfully navigate the training of these powerful sparse models.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with