Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.) - Introduces the Transformer architecture, establishing the dense feed-forward network baseline for comparison with sparse activation models.
Sparsely-Gated Mixture-of-Experts Layers, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017International Conference on Learning RepresentationsDOI: 10.48550/arXiv.1701.06538 - Presents the modern formulation of Mixture of Experts layers within deep learning, detailing sparse activation and its benefits for scaling parameter count without increasing per-token computation.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen, 2020International Conference on Learning RepresentationsDOI: 10.48550/arXiv.2006.16668 - Explores the practical aspects of training large-scale Mixture of Experts models, including distributed training strategies, memory management, and communication overheads.