Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Vol. 30 (Curran Associates, Inc.)DOI: 10.48550/arXiv.1706.03762 - Transformer架构的奠基性论文,它是现代大型语言模型的基础,其注意力机制对计算和内存挑战有显著贡献。
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, Mohammad Shoeybi, Mostafa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, Greg Bernhardt, Vaibhav Gadi, Dale Green, Vijay Korthikoti, Michael Houston, 2019Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19)DOI: 10.1145/3295500.3356183 - 介绍了用于高效训练超大型基于Transformer的语言模型的框架和技术,直接解决了文中讨论的内存和计算挑战。
Scaling Laws for Neural Language Models, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, 2020arXiv preprint arXiv:2001.08361DOI: 10.48550/arXiv.2001.08361 - 探讨了语言模型的经验性缩放法则,展示了模型大小、数据集大小和计算预算如何影响性能以及对巨大计算资源的需求。