GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen, 2020arXiv preprint arXiv:2006.16668DOI: 10.48550/arXiv.2006.16668 - Describes a system for automatically sharding large models, including experts, across multiple devices, and details the all-to-all communication pattern essential for expert parallelism.
Mixture of Experts (MoE) Layer, DeepSpeed Team, 2024 (Microsoft) - Official DeepSpeed documentation on implementing Mixture of Experts layers, covering the framework's approach to distributed training for MoE models.