Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017arXiv preprint arXiv:1701.06538DOI: 10.48550/arXiv.1701.06538 - This seminal paper introduced the modern sparsely-gated Mixture-of-Experts layer, detailing the use of a linear gating network with top-k selection and noise for load balancing, which forms the basis for linear routers.
Attention-based Experts Selection for Deep Neural Networks, Jung-Min Kim, Jong-Seok Lee, 2020Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34 (Association for the Advancement of Artificial Intelligence (AAAI))DOI: 10.1609/aaai.v34i04.5879 - This paper proposes an attention-based mechanism for selecting experts, where attention weights are learned to assign experts for each input, directly illustrating the concept of attention-based routers.