Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, Noam Shazeer, 2022Journal of Machine Learning Research, Vol. 23 - This work presents the Switch Transformer architecture, directly addressing communication overhead and load imbalance in distributed MoE training through effective strategies like a load balancing loss.