Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1701.06538 - This foundational paper introduces the sparsely-gated Mixture-of-Experts layer, demonstrating how to scale neural network capacity using conditional computation.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2006.16668 - This work presents GShard, a system for scaling Mixture of Experts models to hundreds of billions of parameters, addressing challenges of conditional computation and automatic sharding.