Masterclass
As introduced earlier, standard word-level tokenization runs into significant problems when dealing with the massive and diverse text corpora used to train large language models. Primarily, it struggles with out-of-vocabulary (OOV) words and the sheer size of the potential vocabulary. Byte Pair Encoding (BPE) offers an effective data-driven solution to these challenges. Originally developed as a data compression algorithm, BPE has been successfully adapted for text tokenization, creating a vocabulary of subword units.
The core idea behind BPE is elegantly simple: it starts with a vocabulary consisting of all individual characters present in the training corpus and iteratively merges the most frequently occurring adjacent pair of symbols to form new, longer subword symbols. This process continues for a predetermined number of merges, effectively controlling the final vocabulary size.
Let's walk through how the BPE algorithm learns its vocabulary and merge rules from a corpus.
Initialization:
</w>
or </w>
) to distinguish word boundaries. For example, "lower" becomes l
, o
, w
, e
, r
, </w>
.Iteration: Repeat the following steps until the desired vocabulary size Vtarget​ is reached or a set number of merge operations have been performed:
Consider a tiny corpus with word counts: {'low': 5, 'lower': 2, 'newest': 6, 'widest': 3}
.
Step 0: Initialization
</w>
:
low
: l o w </w>
(occurs 5 times)lower
: l o w e r </w>
(occurs 2 times)newest
: n e w e s t </w>
(occurs 6 times)widest
: w i d e s t </w>
(occurs 3 times){l, o, w, </w>, e, r, n, s, t, i, d}
Step 1: Count Pairs and Merge
(l, o)
: 5 + 2 = 7(o, w)
: 5 + 2 = 7(w, </w>)
: 5(w, e)
: 2 + 6 = 8(e, r)
: 2(r, </w>)
: 2(n, e)
: 6(e, w)
: 6(e, s)
: 6 + 3 = 9 <- Most frequent(s, t)
: 6 + 3 = 9 <- Equally frequent (let's pick (e, s)
)(t, </w>)
: 6 + 3 = 9 <- Equally frequent(w, i)
: 3(i, d)
: 3(d, e)
: 3(e, s)
into a new symbol es
. Vocabulary size increases.l o w </w>
(5)l o w e r </w>
(2)n e w es t </w>
(6)w i d es t </w>
(3)e + s -> es
Step 2: Count Pairs and Merge
(l, o)
: 7(o, w)
: 7(w, </w>)
: 5(w, e)
: 2 + 6 = 8(e, r)
: 2(r, </w>)
: 2(n, e)
: 6(e, w)
: 6(w, es)
: 6(es, t)
: 6 + 3 = 9 <- Most frequent (or equally with (t, </w>)
)(t, </w>)
: 6 + 3 = 9(w, i)
: 3(i, d)
: 3(d, es)
: 3(es, t)
into a new symbol est
.l o w </w>
(5)l o w e r </w>
(2)n e w est </w>
(6)w i d est </w>
(3)es + t -> est
Step 3: Count Pairs and Merge
(est, </w>)
: 6 + 3 = 9 <- Most frequent (or equally with (t, </w>)
if est
wasn't picked in step 2)(est, </w>)
into est</w>
.l o w </w>
(5)l o w e r </w>
(2)n e w est</w>
(6)w i d est</w>
(3)est + </w> -> est</w>
This process continues. If we performed more merges, pairs like (l, o)
, (o, w)
, (w, e)
might also get merged, potentially forming low
or we
. The final vocabulary would contain single characters and common multi-character subwords like es
, est
, est</w>
, etc., based on their frequency in the training data.
Once the BPE vocabulary and the ordered list of merge operations are learned from the training corpus, tokenizing new text involves the following steps:
</w>
.For example, if we learned merges e + s -> es
and then es + t -> est
, tokenizing "tests" would proceed as:
t
, e
, s
, t
, s
, </w>
e + s -> es
: t
, es
, t
, s
, </w>
es + t -> est
: No adjacent es
, t
pair exists.t
, es
, t
, s
, </w>
If t + s -> ts
was also learned after est
, it might apply now: t
, es
, ts
, </w>
. The order of merges matters.
A significant advantage of BPE is its inherent ability to handle words not seen during training (OOV words). Since the initial vocabulary includes all single characters, any word can be broken down into a sequence of characters if necessary. If parts of the word correspond to learned subwords, those will be used; otherwise, it falls back to individual characters. For instance, if "huggingface" wasn't in the training data, but "hugg", "ing", "face" were learned subwords (or could be formed by merges), it might be tokenized as hugg
, ing
, face
, </w>
. If not, it might become h
, u
, g
, g
, i
, n
, g
, f
, a
, c
, e
, </w>
. There are no true "unknown" tokens in the sense of needing a dedicated [UNK]
token, although one might still be included for other purposes.
tokenizers
provide optimized implementations. Training a BPE tokenizer might look like this (pseudo-code):# BPE Training Loop
corpus = ["low low low low low", "lower lower", ...] # Your text data
word_counts = get_word_counts(corpus)
vocab = initialize_with_characters(word_counts)
splits = initial_split_words(word_counts) # e.g., {'l o w </w>': 5, ...}
num_merges = target_vocab_size - len(vocab)
merges = {} # Store learned merges
for i in range(num_merges):
pair_counts = count_adjacent_pairs(splits)
if not pair_counts:
break
most_frequent_pair = find_most_frequent(pair_counts)
new_token = merge_pair(most_frequent_pair)
vocab.add(new_token)
merges[most_frequent_pair] = new_token
splits = apply_merge_to_splits(splits, most_frequent_pair, new_token)
# Save vocab and merges
In practice, you'll typically use established libraries. Here's how you might use a pre-trained BPE tokenizer (like GPT-2's) with the Hugging Face transformers
library in a PyTorch context:
import torch
from transformers import GPT2Tokenizer
# Load a pre-trained BPE tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "lower newest widest"
encoded_input = tokenizer(text, return_tensors='pt') # pt for PyTorch tensors
print("Input Text:", text)
print("Token IDs:", encoded_input['input_ids'])
# Example Output (IDs will vary): tensor([[ 3224, 29976, 23564]])
# Note 'lower' is one token, 'newest' is one, 'widest' is one in GPT-2's vocab
print(
"Decoded Tokens:",
tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])
)
# Example Output: ['lower', 'Ä newest', 'Ä widest']
# The 'Ä ' (U+0120) often indicates
# the start of a word/token preceded by space.
# Handling potentially unknown combinations (though BPE handles OOV chars/bytes)
text_oov = "supercalifragilisticexpialidocious"
encoded_oov = tokenizer(text_oov, return_tensors='pt')
print("\nOOV-like Text:", text_oov)
print("Token IDs:", encoded_oov['input_ids'])
print(
"Decoded Tokens:",
tokenizer.convert_ids_to_tokens(encoded_oov['input_ids'][0])
)
# Example Output: ['super', 'cal', 'if', 'rag', 'il',
# 'istic', 'exp', 'ial', 'id', 'ocious']
# The word is broken down into known subword units from the GPT-2 vocabulary.
This code snippet demonstrates how a trained BPE tokenizer segments text into subword units represented by numerical IDs, ready for input into a model like a Transformer. It also shows the OOV handling where an unseen word is broken down into recognizable pieces.
The number of merge operations performed during BPE training directly determines the final vocabulary size. This presents a trade-off:
Choosing the right vocabulary size is an empirical process, often guided by the scale of the model, the characteristics of the training data, and downstream task performance. BPE provides the mechanism to control this balance.
In summary, BPE is a powerful and widely used subword tokenization technique. By learning to merge frequent character or byte pairs from a large corpus, it creates a vocabulary that effectively handles large amounts of text, avoids OOV issues, and allows control over the balance between vocabulary size and sequence length.
© 2025 ApX Machine Learning