Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean, 2017International Conference on Learning Representations (ICLR) 2017 - Introduces the sparsely-gated Mixture-of-Experts (MoE) layer, demonstrating how it enables a large increase in model capacity without a proportional increase in computational cost per token.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, and Noam Shazeer, 2022Journal of Machine Learning Research, Vol. 23 (JMLR, Inc. and Microtome Publishing) - Presents Switch Transformers, which simplify MoE layers to one active expert per token (k=1), showcasing their ability to scale models to trillion parameters while addressing communication and load balancing challenges.
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Quoc V. Le, and Zhifeng Chen, 2022arXiv preprint arXiv:2201.05824DOI: 10.48550/arXiv.2201.05824 - Details GLaM, an MoE architecture that achieves high performance with significantly fewer training FLOPs compared to dense models of similar quality, emphasizing efficiency and scaling.
A Survey of Mixture of Experts, Xufeng Lin, Yiming Qian, Yuanyang Liu, Huadong Liu, Xizhen Sun, Jianyang Li, Guanyu Chen, Qingyu Jin, Meng Zhang, and Bo Xu, 2023arXiv preprint arXiv:2308.14073 (arXiv)DOI: 10.48550/arXiv.2308.14073 - Provides a comprehensive overview of Mixture of Experts models, covering their history, architectural variations, training techniques, and practical considerations like efficiency and load balancing.