Sparsely-Gated Mixture-of-Experts Layers, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean, 2017International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1701.06538 - This foundational paper introduced the concept of sparsely-gated Mixture of Experts and proposed an auxiliary loss for balancing expert load, directly addressing the problem described.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Dmitry Lepikhin, Hieu Pham, Orhan Firat, Michele Catasta, Zhifeng Chen, George Tucker, Azade Nova, Andre Barreto, Max Dean, and Jeff Dean, 2020arXiv preprint arXiv:2006.16668DOI: 10.48550/arXiv.2006.16668 - This work demonstrates the practical scaling of MoE models to massive sizes, emphasizing the necessity of efficient load distribution and automatic sharding strategies which are intrinsically tied to the load balancing problem.
Router Argumentation for Mixture-of-Experts, Koustuv Sinha, Michael Noukhovitch, Subhabrata Roy, Karthik Srinivasan, William Fedus, Michael Ryoo, and Yoshua Bengio, 2022International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2202.04944 - This paper proposes methods to improve the gating network's routing decisions, directly contributing to better load balance and expert specialization by making the routing more robust.