Recap: Pre-trained Language Models and Transformers
Was this section helpful?
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ćukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture, including the self-attention mechanism, which enabled large-scale pre-training of language models.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)DOI: 10.48550/arXiv.1810.04805 - Presents the Bidirectional Encoder Representations from Transformers (BERT) model and introduces the Masked Language Modeling objective for pre-training.
CS224N: Natural Language Processing with Deep Learning, Diyi Yang, Tatsunori Hashimoto, 2025 (Stanford University) - An academic course offering extensive lecture materials and assignments on deep learning methods for NLP, including detailed coverage of Transformers and large language models.