Language Models are Few-Shot Learners, Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, 2020Advances in Neural Information Processing Systems, Vol. 33 (NeurIPS)DOI: 10.48550/arXiv.2005.14165 - Describes the architecture and training of GPT-3, detailing the unprecedented scale and diverse composition of its training dataset, establishing the data requirement for large language models.
Scaling Laws for Neural Language Models, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, 2020International Conference on Learning Representations (ICLR 2020) (OpenReview Foundation)DOI: 10.48550/arXiv.2001.08361 - Presents foundational research on scaling laws for language models, quantitatively describing how performance improves with increased model size, dataset size, and compute, highlighting data's significance.
Training Compute-Optimal Large Language Models, Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Aurelia Guy, Laurent Sifre, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Timothy Lillicrap, Ross Clark, Oriol Vinyals, Chris Dyer, Simon Lacoste-Julien, Geoffrey Hinton, 2022International Conference on Learning Representations (ICLR 2022) (International Conference on Learning Representations (ICLR))DOI: 10.48550/arXiv.2203.15556 - Introduces the 'Chinchilla' scaling law, which refines previous findings by demonstrating that compute-optimal training involves using significantly more data for a given model size than previously understood.