Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1701.06538 - This paper introduces the sparsely-gated Mixture-of-Experts layer, describing the foundational concepts of expert routing and the initial top-k selection mechanism, which is central to this section.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2006.16668 - This work presents the GShard architecture, elaborating on the noisy top-k gating strategy to improve expert load balancing and training stability in large MoE models, directly supporting the "Noisy Top-k Gating" sub-section.