Scaling laws for neural language models, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, 2020arXiv preprint arXiv:2001.08361DOI: 10.48550/arXiv.2001.08361 - 这项基础研究提出了经验性扩展定律,关联了模型大小、数据集大小和训练计算量与模型性能。
Training compute-optimal large language models, Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre, 2022arXiv preprint arXiv:2203.15556DOI: 10.48550/arXiv.2203.15556 - 这篇论文改进了扩展定律,展示了如何在模型大小和数据大小之间优化计算分配,以训练大型语言模型。
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017arXivDOI: 10.48550/arXiv.1706.03762 - 介绍了Transformer架构,该架构是大多数大型语言模型的基础,解释了注意力机制的计算和内存特性。