Techniques for Mitigating Router Z-Loss Instability
Was this section helpful?
Sparsely-Gated Mixture-of-Experts Layers, Noam M. Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, Jeffrey Dean, 2017Advances in Neural Information Processing Systems 30 (NeurIPS 2017)DOI: 10.48550/arXiv.1701.06538 - This seminal paper introduces the Mixture of Experts architecture and the concept of an auxiliary load balancing loss, which is fundamental to the router's operation and the origin of potential z-loss.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, Noam Shazeer, 2022The Journal of Machine Learning Research, Vol. 23 (JMLR, Inc. and Microtome Publishing)DOI: 10.5555/3547192.3547209 - This paper details the practical challenges and solutions for training large-scale Mixture of Experts models, including discussions on the implementation and tuning of the load balancing loss (from which router z-loss derives) to ensure stable training.
Stable and Efficient Training of Sparse Mixture-of-Experts Models, Zonglin Yang, Zhiqiang Shen, Xiaodan Liang, Shanshan Zhang, Junjie Yan, Xian-Sheng Hua, and Deng Cai, 2023International Conference on Learning Representations (ICLR 2023) (ACM)DOI: 10.5555/3587498.3587572 - This paper specifically addresses numerical instability issues in training sparse Mixture-of-Experts models, providing in-depth analysis and mitigation techniques directly relevant to managing router z-loss.