Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NIPS 2017), Vol. 30 (Curran Associates, Inc.)DOI: 10.5555/3295222.3295349 - The foundational paper introducing the Transformer architecture, which underpins most modern LLMs and enables their sophisticated contextual understanding and next-token prediction capabilities.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - A comprehensive textbook covering language models, sequence prediction, and the statistical foundations of natural language processing, which are central to understanding LLM mechanisms. Chapter 3 ("N-gram Language Models") and relevant sections on deep learning for NLP are particularly relevant.
The Hugging Face Course: How do Large Language Models work?, Hugging Face, 2023 (Hugging Face) - Provides an accessible yet detailed explanation of how large language models generate text through sequential token prediction, probability distributions, and decoding strategies like greedy decoding.