Preprocessing data: Tokenization, Hugging Face, 2023 (Hugging Face) - Official guide to tokenization within the Hugging Face Transformers library, detailing concepts like special tokens, padding, and truncation.
Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch, 2016Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational Linguistics)DOI: 10.18653/v1/P16-1162 - Introduces Byte Pair Encoding (BPE), a subword tokenization algorithm foundational to many LLMs.
Language Models are Unsupervised Multitask Learners, Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, 2019 (OpenAI) - The paper introducing GPT-2, it outlines the model's architecture and the BPE tokenization approach it uses.