Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1701.06538 - Foundational paper introducing the sparsely-gated Mixture-of-Experts (MoE) layer, detailing the concept of expert capacity and the auxiliary load-balancing loss to ensure even token distribution.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen, 2021International Conference on Learning Representations (ICLR), Vol. 139 (OpenReview)DOI: 10.5555/3524938.3525287 - Describes Google's large-scale implementation of MoE, providing practical details on managing expert capacity, handling token routing efficiently, and preventing token dropping in real-world scenarios.