Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017arXiv preprint arXiv:1701.06538DOI: 10.48550/arXiv.1701.06538 - This foundational paper introduces the sparsely-gated Mixture-of-Experts layer, including the auxiliary loss for load balancing and noisy top-k gating, which are essential for stable router learning.