Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017arXiv (arXiv)DOI: 10.48550/arXiv.1701.06538 - Introduces the sparsely-gated Mixture-of-Experts layer, defining the router (gating network) and its function in directing tokens to experts. This paper provides a fundamental understanding of the MoE architecture.
NVIDIA TensorRT Developer Guide, NVIDIA Corporation, 2023 (NVIDIA Corporation) - A guide on optimizing deep learning inference through methods like quantization, layer fusion, and kernel optimization, applicable to router components for performance gains.
GShard: Scaling Giant Models with Automatic Partitioning and Parallelism, Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen, 2020arXiv preprint arXiv:2006.16668DOI: 10.48550/arXiv.2006.16668 - Discusses distributed training and inference of large MoE models, detailing challenges and solutions related to expert parallelism and communication patterns that impact router deployment.