Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1701.06538 - Foundational work introducing the Mixture of Experts architecture, discussing the need for sparsity to scale models and hinting at its computational and memory implications.
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training for Large Language Models, Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, 2022Proceedings of the 39th International Conference on Machine Learning (ICML 2022), Vol. 162 (PMLR)DOI: 10.48550/arXiv.2201.05596 - Describes specific inference challenges of MoE models and presents solutions for optimizing latency, throughput, and memory utilization, including techniques for load balancing and communication.
GLaM: Efficient Scaling of Language Models with MoE, Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui, 2022International Conference on Machine Learning (ICML) 2022DOI: 10.48550/arXiv.2112.06905 - Presents a large-scale MoE language model, discussing the practical aspects of its efficiency during inference, including memory and computation considerations at scale.