Adaptive Mixture of Local Experts, Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton, 1991Neural Computation, Vol. 3 (MIT Press)DOI: 10.1162/neco.1991.3.1.79 - Introduces the fundamental concept of a Mixture of Experts architecture where different 'experts' specialize in different regions of the input space, managed by a gating network.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean, 2017arXiv preprint arXiv:1701.06538DOI: 10.48550/arXiv.1701.06538 - Presents the sparsely-gated Mixture-of-Experts layer, a key innovation for scaling neural networks to vast parameter counts while maintaining constant computational cost per example, particularly in the context of Transformers.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Dmitry Lepikhin, Hieu Pham, Anselm Levskaya, Jonathan Shlens, Artem Grygoryev, Xingguang Chen, Yanqi Zhou, Yuanzhong Xu, Vikram Muralidharan, George Tucker, Anirudh Gowthaman, Chandraekhar Sowrirajan, David So, Jeffrey Dean, 2021International Conference on Learning Representations (ICLR) 2021DOI: 10.48550/arXiv.2006.16668 - Details a system for automatically sharding and training giant conditional computation models, including Sparse MoEs, across thousands of accelerators, enabling models with trillions of parameters.