Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017arXiv preprint arXiv:1701.06538DOI: 10.48550/arXiv.1701.06538 - Introduces the sparsely-gated Mixture-of-Experts (MoE) layer for neural networks, enabling models with a vast number of parameters while maintaining a constant computational cost per example. This is foundational to the concept of decoupling parameters from computation.
Scaling Laws for Neural Language Models, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, 2020arXiv preprint arXiv:2001.08361DOI: 10.48550/arXiv.2001.08361 - A seminal paper exploring how the performance of dense neural language models scales with model size, dataset size, and computational budget, providing the context for why MoE's parameter efficiency is so beneficial.