Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017arXiv preprint arXiv:1701.06538DOI: 10.48550/arXiv.1701.06538 - This foundational paper introduces the sparsely-gated Mixture-of-Experts (MoE) layer, providing context by contrasting it with dense (soft) gating and discussing the computational advantages of sparsity.
Learning to Route: A Differentiable Approach to Mixture of Experts, Clemens Rosenbaum, Chetan Sanan, Charith Gunasekara, Josh Trani, Andrew Gordon Wilson, Kyunghyun Cho, 2018Proceedings of the 35th International Conference on Machine Learning (ICML), Vol. 80 (PMLR)DOI: 10.5555/3326938.3326955 - This paper presents a methodology for achieving differentiable routing in Mixture of Experts models, aligning directly with the section's discussion of soft routing.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS) 30DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture and its self-attention mechanism, which serves as an excellent analogy for the weighted sum computation in soft routing.