GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui, 2022ICML 2022DOI: 10.48550/arXiv.2112.06905 - 提出了一种MoE架构,专注于效率和扩展,论述了资源分配和容量限制如何影响令牌处理和潜在的丢弃。