While auxiliary losses and careful router design are significant for stable Mixture of Experts (MoE) training, the choice of optimizer and its associated hyperparameters (learning rate, weight decay, momentum parameters) also plays a substantial role. These settings directly influence how gradients, including those from the load balancing loss, update the model parameters, impacting both expert specialization and router stability. Failing to tune these appropriately can undermine the benefits of sophisticated architectures or auxiliary losses.Standard Optimizers in MoE TrainingFor large transformer-based models, including MoEs, the AdamW optimizer remains a common and often effective choice. AdamW combines adaptive learning rates (like Adam) with decoupled weight decay (unlike Adam's L2 regularization implementation). This combination is generally well-suited for the complexities of large models.The adaptive nature of Adam/AdamW, which maintains per-parameter learning rates based on estimates of the first and second moments of the gradients, can be beneficial for MoE layers. Experts that are activated less frequently might receive smaller gradient updates on average; adaptive methods can help compensate for this, ensuring that even infrequently used experts continue to learn. However, this same adaptivity can sometimes interact unexpectedly with the auxiliary load balancing loss and the router's gradients.Learning Rate SensitivityThe learning rate is arguably the most critical hyperparameter for MoE training stability, particularly for the gating network.Router Sensitivity: The gating network's parameters determine the routing decisions. If the learning rate is too high, the router's outputs can change drastically between updates, leading to unstable routing assignments, oscillating load balance metrics, and potentially causing training divergence. The gradients flowing to the router depend on both the main task loss and the auxiliary load balancing loss $L_{aux}$. A large learning rate combined with a significant auxiliary loss coefficient $\alpha$ can cause overly aggressive updates to the router weights.Expert Learning: While experts are typically standard feed-forward networks, their learning is influenced by the conditional computation. A learning rate appropriate for dense training might need adjustment based on how frequently experts are activated and the overall training stability.Learning Rate Schedules: Employing learning rate schedules, such as linear warmup followed by cosine or linear decay, is standard practice and highly recommended for MoEs. Warmup allows the model, especially the router, to stabilize early in training before larger learning rates take effect. The subsequent decay helps fine-tune the model towards convergence. The duration of the warmup phase might need careful tuning; a longer warmup can sometimes help stabilize routing in challenging setups.Consider experimenting with differential learning rates, potentially using a smaller learning rate for the gating network compared to the experts, although this adds complexity to the training configuration.Weight DecayWeight decay acts as a regularizer, penalizing large parameter values to prevent overfitting. In MoEs:It applies to both the expert parameters and the gating network parameters.For experts, it functions similarly to standard network regularization.For the router, weight decay can sometimes help prevent the router from becoming overly confident in specific routing decisions early on, potentially promoting exploration.Its interaction with the load balancing loss $L_{aux}$ requires consideration. If $L_{aux}$ strongly encourages diversification, aggressive weight decay on the router might be less necessary or even counterproductive if it overly dampens learned routing signals. As with other hyperparameters, its optimal value is typically found empirically.Optimizer State: Betas in AdamWAdamW uses exponential moving averages to estimate the first moment (mean, controlled by $\beta_1$) and second moment (uncentered variance, controlled by $\beta_2$) of the gradients.$\beta_1$ controls the momentum effect. Values closer to 1 mean past gradients have a longer-lasting influence.$\beta_2$ controls the adaptive learning rate scaling. Values closer to 1 mean past squared gradients have a longer-lasting influence, leading to slower changes in the per-parameter learning rates.While the default values (e.g., $\beta_1=0.9$, $\beta_2=0.999$ or $\beta_2=0.95$ for some large model training) often work well, highly unstable MoE training runs might occasionally benefit from tuning these. For instance, lowering $\beta_2$ slightly can make the adaptive learning rates more reactive to recent gradient information, which could be helpful or harmful depending on the specific instability pattern. However, adjusting betas is usually considered a secondary tuning step after optimizing the learning rate and auxiliary loss coefficient.Interaction with Load Balancing and Gradient ClippingThe optimizer sees the gradient of the total loss: $$ \nabla L_{total} = \nabla L_{task} + \alpha \nabla L_{aux} $$ The scale of $\nabla L_{aux}$ relative to $\nabla L_{task}$, modulated by $\alpha$, directly impacts the updates computed by AdamW. If $\alpha$ is too large, the load balancing gradient can dominate, potentially disrupting the learning dynamics related to the primary task. This emphasizes the joint tuning of $\alpha$ and the learning rate.Gradient clipping, which caps the norm of the gradients before the optimizer step, is another important technique for MoE stability. It prevents occasional large gradients, perhaps arising from unstable routing or specific difficult batches, from destabilizing the optimizer state (especially the moment estimates in AdamW) and causing large parameter swings. A typical clip value might be 1.0, but this too can be tuned.{"layout": {"title": "Impact of Learning Rate on MoE Load Balance Stability", "xaxis": {"title": "Training Steps"}, "yaxis": {"title": "Load Balance Metric (e.g., CV of Loads)", "range": [0, 1.5]}, "legend": {"title": "Learning Rate"}, "template": "plotly_white"}, "data": [{"x": [0, 100, 200, 300, 400, 500], "y": [1.2, 0.8, 0.6, 0.5, 0.45, 0.4], "mode": "lines", "name": "LR = 1e-4 (Stable)", "line": {"color": "#20c997"}}, {"x": [0, 100, 200, 300, 400, 500], "y": [1.3, 0.6, 0.9, 0.5, 0.8, 0.4], "mode": "lines", "name": "LR = 5e-4 (Oscillating)", "line": {"color": "#ff922b"}}, {"x": [0, 100, 200, 300, 400, 500], "y": [1.4, 1.5, 1.3, 1.6, 1.4, 1.7], "mode": "lines", "name": "LR = 1e-3 (Unstable)", "line": {"color": "#fa5252"}}]}Evolution of a load balancing metric during initial MoE training under different learning rates. Higher learning rates can lead to oscillations or instability in expert load distribution.Practical RecommendationsFinding the optimal combination of optimizer settings and hyperparameters for MoE training is typically an empirical process.Start with Defaults: Begin with standard AdamW settings ($\beta_1=0.9, \beta_2=0.999$), a well-established learning rate schedule (warmup + decay), and a moderate weight decay (e.g., 0.01 or 0.1).Tune Learning Rate and $\alpha$ Jointly: The peak learning rate and the auxiliary loss coefficient $\alpha$ are often the most sensitive parameters. Perform sweeps or experiments varying these together, monitoring both the task loss and load balancing metrics.Monitor Metrics: Closely track the load balancing metric (e.g., coefficient of variation of expert loads), router entropy/confidence, and the magnitude of dropped tokens throughout training. Instability in these often points towards suboptimal hyperparameters.Use Gradient Clipping: Implement gradient norm clipping as a safety measure against exploding gradients.Consider Longer Warmup: If initial stability is a problem, try extending the learning rate warmup phase.In summary, while sophisticated architectural choices and auxiliary losses are fundamental to MoE success, they must be paired with careful selection and tuning of the optimizer and its hyperparameters. The learning rate, its schedule, and its interaction with the load balancing coefficient are particularly significant factors influencing router stability, expert utilization, and overall convergence.