Masterclass
Before a model like the Transformer can process text, the raw sequence of characters must be converted into a sequence of numerical IDs. This conversion process is known as tokenization. While simple methods like splitting text by spaces work for smaller tasks, they struggle with the vast vocabularies and morphological variations found in the massive datasets used for LLMs. Handling unknown words (Out-Of-Vocabulary or OOV) and managing potentially millions of unique words requires more sophisticated approaches.
This chapter focuses on subword tokenization algorithms designed to address these challenges. You will learn about Byte Pair Encoding (BPE) and WordPiece, techniques that build a vocabulary based on frequent subword units rather than whole words. We will also cover the SentencePiece framework, the role and management of special tokens (like [CLS]
, [SEP]
), and the practical considerations for choosing a vocabulary size (∣V∣), balancing model expressiveness and computational efficiency. By the end, you'll understand how to prepare text data effectively for large models.
5.1 The Need for Subword Tokenization
5.2 Byte Pair Encoding (BPE) Algorithm
5.3 WordPiece Tokenization
5.4 SentencePiece Implementation
5.5 Handling Special Tokens
5.6 Vocabulary Size Selection Trade-offs
© 2025 ApX Machine Learning