While basic tokenization methods, like splitting text based on whitespace or punctuation, are fundamental first steps, they often fall short when dealing with the complexities of real-world language. Consider the challenges:
To address these limitations, more sophisticated tokenization strategies based on subword units have become standard practice, especially in modern deep learning models for NLP. Instead of treating whole words as the basic unit, subword tokenization breaks words into smaller, meaningful pieces. This approach cleverly balances the granularity of character-level tokens with the semantic representation of word-level tokens. Two prominent techniques are Byte-Pair Encoding (BPE) and WordPiece.
Originally developed as a data compression algorithm, Byte-Pair Encoding (BPE) was adapted for NLP tokenization to manage vocabulary size effectively while handling word variations. The core idea is iterative merging of frequent character sequences.
How BPE Works:
</w>
or </s>
) to the end of each word in the corpus to distinguish word boundaries.'t'
followed by 'h'
). Merge this pair into a single new symbol (e.g., 'th'
).Example:
Let's trace a simplified BPE process on the corpus "hug hugger huge hugs", adding </w>
to mark word ends:
h u g </w> h u g g e r </w> h u g e </w> h u g s </w>
{h, u, g, </w>, e, r, s}
h:4, u:4, g:5, </w>:4, e:2, r:1, s:1
(h u):4, (u g):4, (g </w>):1, (g g):1, (g e):2, (e r):1, (r </w>):1, (e </w>):1, (g s):1, (s </w>):1
(h u)
and (u g)
are most frequent (4 times). Let's merge u
and g
into ug
.
{h, u, g, </w>, e, r, s, ug}
h ug </w> h ug g e r </w> h ug e </w> h ug s </w>
(h ug)
occurs 4 times. Merge h
and ug
into hug
.
{h, u, g, </w>, e, r, s, ug, hug}
hug </w> hug g e r </w> hug e </w> hug s </w>
(hug </w>):1, (hug g):1, (g e):2, (e r):1, (r </w>):1, (hug e):1, (e </w>):1, (hug s):1, (s </w>):1
. The most frequent pair is (g e)
(2 times). Merge g
and e
into ge
.
{h, u, g, </w>, e, r, s, ug, hug, ge}
hug </w> hug g er </w> hug ge </w> hug s </w>
This process continues. Frequent sequences like "er" might be merged. Eventually, the vocabulary contains common characters and frequent subword units.
Tokenizing New Text: To tokenize a new word (e.g., "hugging"), BPE applies the learned merge rules greedily. It would first break it into characters h, u, g, g, i, n, g, </w>
. Then it would apply learned merges in the order they were learned. If ug
was learned, then hug
, then ing
(hypothetically), the result might be ['hug', 'g', 'ing', '</w>']
. An unknown word like "snuggles" might be broken down into ['s', 'n', 'ug', 'gle', 's', '</w>']
if those subwords exist in the vocabulary, effectively handling the OOV problem by representing it with known parts.
BPE is used by models like OpenAI's GPT series.
WordPiece is another popular subword tokenization algorithm, conceptually similar to BPE but differing in its merge criterion. It is used by models like Google's BERT.
How WordPiece Works:
Tokenization Strategy: When tokenizing new text, WordPiece attempts to find the longest possible subword in its vocabulary that matches the beginning of the current word. If a subword unit does not represent the start of an original word, it's typically prefixed with a special symbol, often ##
(e.g., "tokenization" might become ['token', '##ization']
). This explicit marking helps the model understand that ##ization
is a continuation of a previous piece.
Comparison:
Feature | BPE | WordPiece |
---|---|---|
Merge Criterion | Highest frequency of adjacent pair | Highest likelihood increase for the data |
Implementation | Simpler, based purely on counts | More complex, involves likelihood calculation |
Subword Marking | Usually uses end-of-word symbols (</w> ) |
Often uses prefix markers (## ) for continuations |
Common Models | GPT, RoBERTa | BERT, DistilBERT |
Both BPE and WordPiece effectively reduce vocabulary size compared to word-level tokens, handle OOV words by breaking them down into known subwords, and capture morphological relationships (e.g., "run", "running" likely share the "run" subword).
Subword tokenization offers a middle ground, providing a manageable vocabulary size while retaining more semantic information than character-level tokens and handling OOV words better than word-level tokens. Note the logarithmic scale on the Y-axis.
It's also worth mentioning SentencePiece, developed by Google. It treats the input text as a raw stream of Unicode characters, including whitespace. This means it can learn to tokenize text without relying on pre-tokenization (like splitting by spaces) and can handle multiple languages more easily. SentencePiece can be configured to use either BPE or a unigram language modeling approach for building its subword vocabulary. A significant advantage is its fully reversible tokenization; the original text can be perfectly reconstructed from the tokens because whitespace is treated as part of the subword units (often represented by a meta-symbol like
).
The choice between BPE, WordPiece, or SentencePiece often depends on the specific pre-trained model you intend to use, as models are typically released with their corresponding trained tokenizers. Libraries like Hugging Face's transformers
abstract away many of the implementation details, allowing you to easily load and use the appropriate tokenizer for a given model (e.g., BertTokenizer
uses WordPiece, GPT2Tokenizer
uses BPE).
Training these tokenizers requires large amounts of text data representative of the domain you're working in. However, for many applications, using the pre-trained tokenizer associated with a pre-trained language model is the most practical approach.
Understanding these advanced tokenization methods is important because the way text is broken down fundamentally influences how downstream models interpret and process language data. They represent a significant step beyond simple splitting, enabling modern NLP models to handle diverse vocabulary and morphology more effectively.
© 2025 ApX Machine Learning