Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean, 2017International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1701.06538 - This foundational paper introduces the Sparsely-Gated Mixture of Experts (MoE) layer, detailing its architecture and demonstrating how it enables training models with vastly more parameters while maintaining computational efficiency.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2006.16668 - This paper explores the application of conditional computation, specifically MoE layers, to scale deep learning models to hundreds of billions of parameters, introducing techniques for efficient training and sharding.