Scaling Laws for Neural Language Models, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, 2020arXiv preprint arXiv:2001.08361DOI: 10.48550/arXiv.2001.08361 - 一篇里程碑式的论文,探讨了密集神经网络语言模型的性能如何随模型大小、数据集大小和计算预算扩展,为解释MoE参数效率的优势提供了背景。