Language Models are Few-Shot Learners, Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, 2020Advances in Neural Information Processing Systems, Vol. 33 (NeurIPS)DOI: 10.48550/arXiv.2005.14165 - 描述了GPT-3的架构和训练,详细说明了其训练数据集的规模和多样性,确立了大型语言模型对数据的需求。
Scaling Laws for Neural Language Models, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, 2020International Conference on Learning Representations (ICLR 2020) (OpenReview Foundation)DOI: 10.48550/arXiv.2001.08361 - 提出了语言模型扩展定律的基础研究,定量描述了性能如何随模型大小、数据集大小和计算资源的增加而提高,强调了数据的重要性。
Training Compute-Optimal Large Language Models, Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Aurelia Guy, Laurent Sifre, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Timothy Lillicrap, Ross Clark, Oriol Vinyals, Chris Dyer, Simon Lacoste-Julien, Geoffrey Hinton, 2022International Conference on Learning Representations (ICLR 2022) (International Conference on Learning Representations (ICLR))DOI: 10.48550/arXiv.2203.15556 - 提出了“Chinchilla”扩展定律,通过证明计算最优训练在给定模型大小下需要比以前认为的更多数据,从而完善了先前的研究成果。