Sparsely Gated Mixture of Experts Layers, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017Advances in Neural Information Processing Systems, Vol. 30 (Neural Information Processing Systems Foundation, Inc. (NeurIPS))DOI: 10.48550/arXiv.1701.06538 - A foundational work that introduced sparsely gated Mixture of Experts layers to deep learning, discussing the challenges in training MoE models and highlights expert utilization and balanced routing.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Dmitry Lepikhin, Hieu Pham, Yanqi Zhou, Zhifeng Chen, Jeff Dean, 2021International Conference on Learning Representations (ICLR) (ICLR)DOI: 10.48550/arXiv.2006.16668 - Presents an early large-scale MoE system that introduced several techniques for training conditional computation models, including auxiliary losses for load balancing and efficient distributed training.