Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, Noam Shazeer, 2022Journal of Machine Learning Research, Vol. 23 (Microtome Publishing) - Presents Switch Transformers, a simplified MoE architecture (k=1 routing) that scales to extremely large models and addresses training stability and load balancing.