Before a Transformer model can process text, the raw strings must be converted into a numerical format it understands. This conversion process is known as tokenization. While simpler methods exist, such as splitting text by spaces to get words, they quickly run into problems with large vocabularies and words not seen during training (out-of-vocabulary or OOV words). Transformers typically leverage more sophisticated techniques called subword tokenization algorithms.
The fundamental idea behind subword tokenization is to break down words into smaller, frequently occurring units. This approach offers a balance: it keeps the vocabulary size manageable while significantly reducing the chance of encountering unknown tokens. Instead of representing "transformer" and "transformers" as two entirely separate tokens, a subword tokenizer might represent them as ["transform", "er"]
and ["transform", "ers"]
. The root "transform" is learned, and common affixes like "er" and "ers" are learned separately. This allows the model to potentially understand novel combinations or variations of words based on the subword units it has learned.
Let's explore the most common subword tokenization algorithms used with Transformer models:
Originally a data compression algorithm, Byte Pair Encoding (BPE) was adapted for text tokenization. It works iteratively:
Consider a simplified example with a tiny corpus and few merges:
{"low low low", "lowest lowest", "newer newer", "wider wider"}
{'l', 'o', 'w', ' ', 's', 't', 'n', 'e', 'r', 'i', 'd'}
('l', 'o')
, ('o', 'w')
, ('w', ' ')
, ('e', 'r')
are frequent. Let's say ('e', 'r')
is the most frequent.{'l', 'o', 'w', ' ', 's', 't', 'n', 'er', 'i', 'd'}
. Corpus conceptually becomes {"low low low", "lowest lowest", "new'er' new'er'", "wid'er' wid'er'"}
.('w', 'er')
or (' ', 'er')
are counted. Maybe ('l', 'o')
is now most frequent.{'lo', 'w', ' ', 's', 't', 'n', 'er', 'i', 'd'}
. Corpus becomes {"'lo'w 'lo'w 'lo'w", "'lo'west 'lo'west", "new'er' new'er'", "wid'er' wid'er'"}
.('lo', 'w')
becomes frequent.{'low', ' ', 's', 't', 'n', 'er', 'i', 'd', 'e', 'w'}
(note 'e', 'w' are still needed for "lowest"). Corpus: {"'low' 'low' 'low'", "'low'est 'low'est", "new'er' new'er'", "wid'er' wid'er'"}
.This process continues, building up common subwords like "lowest", "newer", "wider" or potentially stopping earlier depending on the desired vocabulary size.
A simplified visualization of BPE merging steps for the word "newer". Initial characters are combined based on frequency to form subwords like "er", potentially "new" (assuming 'n','e' merged), and finally "newer".
WordPiece is conceptually similar to BPE but uses a different criterion for merging. Instead of merging the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data, given the vocabulary. It essentially asks: "Which merge makes the training data most probable under a simple language model defined by the vocabulary?". WordPiece is notably used by BERT and related models. It often results in subwords that align well with linguistic morphemes, although that's not its explicit goal. A common convention for WordPiece is to prefix subwords that continue a word with ##
(e.g., "transformer" might become ["transform", "##er", "##s"]
).
SentencePiece treats the input text as a raw sequence, including whitespace. Unlike BPE and WordPiece, which often require pre-tokenization (like splitting by spaces), SentencePiece operates directly on the raw byte stream or Unicode characters. It encodes whitespace explicitly, often using a special character like
(U+2581) to represent a space within a token. This makes it particularly effective for languages where word boundaries are not clearly defined by spaces and allows for a single, consistent tokenization/detokenization process across different languages without language-specific logic.
Beyond subwords derived from the data, Transformer tokenizers include several special tokens essential for model operation:
[PAD]
(Padding Token): Used to make sequences in a batch the same length. The model learns to ignore these via the attention mask.[UNK]
(Unknown Token): Represents any subword not present in the tokenizer's vocabulary. Ideally, subword tokenization minimizes the occurrence of [UNK]
.[CLS]
(Classification Token): Often added to the beginning of an input sequence. The final hidden state corresponding to this token is frequently used as the aggregate sequence representation for classification tasks.[SEP]
(Separator Token): Used to separate distinct segments of text within a single input sequence (e.g., separating question and context in question answering, or two sentences for next-sentence prediction).[MASK]
(Mask Token): Used specifically during masked language model pre-training (like in BERT), where input tokens are randomly replaced with [MASK]
, and the model learns to predict the original tokens.Implementing these tokenization algorithms from scratch can be complex. Fortunately, excellent libraries handle this efficiently. The Hugging Face tokenizers
library provides highly optimized implementations of BPE, WordPiece, and others, allowing you to easily load pre-trained tokenizers or train your own on a specific corpus.
# Example using Hugging Face tokenizers (conceptual)
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
# 1. Initialize a tokenizer with a BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
# 2. Customize pre-tokenization (e.g., split by whitespace first)
tokenizer.pre_tokenizer = Whitespace()
# 3. Define a trainer (vocab size, special tokens)
trainer = BpeTrainer(vocab_size=10000, special_tokens=[
"[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"
])
# 4. Train on your text files
files = ["path/to/your/corpus.txt", ...] # List of text files
tokenizer.train(files, trainer)
# 5. Save the tokenizer
tokenizer.save("my-bpe-tokenizer.json")
# 6. Load and use
tokenizer = Tokenizer.from_file("my-bpe-tokenizer.json")
output = tokenizer.encode("This is example text.")
print(f"Tokens: {output.tokens}")
# Example Output: Tokens: ['This', 'Ġis', 'Ġexample', 'Ġtext', '.']
# (Note: 'Ġ' often represents space in SentencePiece/BPE variants)
print(f"IDs: {output.ids}")
# Example Output: IDs: [713, 164, 1794, 1036, 5]
Understanding tokenization is the first practical step in preparing your data for a Transformer. The choice of algorithm and vocabulary size impacts how the model "sees" the text, influencing its performance and ability to generalize. Once text is converted into these sequences of integer IDs (like output.ids
above), you can proceed to create batches suitable for training, which we'll discuss next.
© 2025 ApX Machine Learning