GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui, 2022ICML 2022DOI: 10.48550/arXiv.2112.06905 - Presents an MoE architecture focusing on efficiency and scaling, addressing how resource allocation and capacity limits impact token processing and potential dropping.