Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.1706.03762 - Describes the Transformer architecture, which is fundamental to modern large language models, explaining the mechanism behind their pattern-matching capabilities.
Language Models are Few-Shot Learners, Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, 2020Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.2005.14165 - Presents GPT-3, demonstrating how increased scale in parameters and training data enables large language models to perform a wide array of tasks with minimal task-specific data.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, 2019Journal of Machine Learning Research (JMLR)DOI: 10.48550/arXiv.1910.10683 - Details the Text-to-Text Transfer Transformer (T5) model, offering a comprehensive study of transfer learning techniques and showing how various NLP tasks can be framed as text-to-text problems.
On the Opportunities and Risks of Foundation Models, Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Dilara Bakar, Percy Liang, et al., 2021arXiv (Stanford Institute for Human-Centered Artificial Intelligence (HAI))DOI: 10.48550/arXiv.2108.07258 - Introduces the concept of foundation models, of which large language models are a prominent type, discussing their shared capabilities and implications across various applications.