Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017arXiv preprint arXiv:1701.06538DOI: 10.48550/arXiv.1701.06538 - This foundational paper introduces the sparsely-gated Mixture-of-Experts layer, detailing the gating network, Top-k routing, and weighted output combination, forming the basis for modern MoE architectures.
Learning with Experts, Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, Geoffrey E. Hinton, 1991Neural Computation, Vol. 3 (MIT Press)DOI: 10.1162/neco.1991.3.1.79 - This seminal work introduces the concept of learning with mixtures of experts, laying much of the theoretical groundwork for combining multiple specialized networks with a gating network to process inputs.