Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS 2017)DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture, which forms the basis for modern large language models and their learning mechanisms.
Language Models are Few-Shot Learners, Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, 2020Advances in Neural Information Processing Systems (NeurIPS 2020)DOI: 10.48550/arXiv.2005.14165 - Details the training and capabilities of GPT-3, demonstrating how scaling up next-token prediction on vast datasets leads to powerful language understanding and generation.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2019Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)DOI: 10.48550/arXiv.1810.04805 - Describes pre-training large Transformer models on massive text corpora using self-supervised tasks like masked language modeling, which is foundational for learning broad language representations.
Speech and Language Processing, Daniel Jurafsky and James H. Martin, 2025 (Stanford University) - Offers a comprehensive academic introduction to natural language processing, including the foundational theories of language models and neural network training.