While Byte Pair Encoding (BPE) merges the most frequent pairs of tokens, WordPiece takes a different, yet related, approach. Developed at Google and notably used for models like BERT (Bidirectional Encoder Representations from Transformers) and its variants, WordPiece also builds a vocabulary iteratively by merging units. However, its merge criterion is based on maximizing the likelihood of the training data, rather than raw frequency counts.

The process starts similarly to BPE: initialize the vocabulary with all individual characters present in the training data. Then, iteratively consider merging adjacent tokens. The core difference lies in selecting which pair to merge. WordPiece chooses the pair (say, $A$ and $B$ ) such that merging them into a single token $AB$ results in the largest increase in the likelihood of the training corpus, assuming a unigram language model over the tokens.

The Likelihood Maximization Criterion

Imagine the training corpus as a sequence of tokens generated from our current vocabulary $V$ . The likelihood of the corpus $L$ is the product of the probabilities of each token occurring in the sequence:

L = \prod_{token \in Corpus} P(token)

The probability $P(token)$ is typically estimated as its frequency in the corpus divided by the total number of tokens:

P(token) = \frac{count(token)}{\sum_{t \in V} count(t)}

When we consider merging two tokens, $A$ and $B$ , into a new token $AB$ , we effectively modify the token sequence wherever $A$ and $B$ appear adjacently. This changes the counts of $A$ , $B$ , and $AB$ , and also the total number of tokens in the corpus (since each merge reduces the token count by one). WordPiece evaluates this change for all possible adjacent pairs in the current tokenization of the corpus. The pair $(A, B)$ whose merge results in the highest likelihood $L$ for the modified corpus is chosen and added to the vocabulary for the next iteration.

In practice, calculating the full likelihood change can be complex. Often, approximations or scores related to likelihood are used. A common way to think about this is selecting the merge $(A, B) \to AB$ that maximizes a score like:

score(A, B) = \frac{count(AB)}{count(A) \times count(B)}

While this isn't exactly the likelihood change, it captures a similar intuition: merging pairs that frequently occur together relative to their individual frequencies is beneficial. The exact scoring function can vary between implementations.

WordPiece in Practice: Prefixing and Implementation

Like many BPE implementations, WordPiece typically handles subwords within a word by adding a special prefix (commonly ##) to tokens that represent continuations of a word. The initial word split might occur at whitespace, and then WordPiece tokenization proceeds within those initial word units.

For example, the word "hugging" might initially be split into characters: h, u, g, g, i, n, g. Through iterative merges based on likelihood maximization, the vocabulary might eventually contain hug and ##ging. The word "hugging" would then be tokenized as ['hug', '##ging']. The ## indicates that ging is attached to the preceding token without a space. This allows the model to differentiate between a token starting a word and one occurring mid-word, and enables unambiguous detokenization.

Let's see how to use a pre-trained WordPiece tokenizer, specifically the one used for BERT, via the Hugging Face transformers library in PyTorch.

# Make sure you have transformers and torch installed
# pip install transformers torch

from transformers import BertTokenizer

# Load a pre-trained BERT tokenizer (which uses WordPiece)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

sentence = "WordPiece maximizes likelihood."

# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)

# Convert tokens to their corresponding IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)

# Get the full output including special tokens ([CLS], [SEP])
encoded_input = tokenizer(sentence)

print(f"Original Sentence: {sentence}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {input_ids}")
print(f"Full Encoded Input (with special tokens):")
print(f"{encoded_input['input_ids']}")
print(f"Decoded Full Input: {tokenizer.decode(encoded_input['input_ids'])}")

Running this code produces output similar to this:

Original Sentence: WordPiece maximizes likelihood.
Tokens: ['word', '##piece', 'max', '##imi', '##zes', 'like', '##li', '##hood', '.']
Token IDs: [2773, 19352, 4011, 24027, 16464, 2066, 2135, 12731, 1012]
Full Encoded Input (with special tokens):
[101, 2773, 19352, 4011, 24027, 16464, 2066, 2135, 12731, 1012, 102]
Decoded Full Input: [CLS] wordpiece maximizes likelihood. [SEP]

Notice how "WordPiece" is split into word and ##piece, "maximizes" into max, ##imi, and ##zes, and "likelihood" into like, ##li, and ##hood. The ## prefix clearly marks the subword units that don't start a word. The final encoded_input['input_ids'] includes the special [CLS] token ID (101) at the beginning and the [SEP] token ID (102) at the end, which are standard requirements for BERT-style models.

Comparing WordPiece and BPE

The primary distinction between WordPiece and BPE is the criterion for merging tokens: BPE uses frequency, while WordPiece uses corpus likelihood maximization (or a related score). This difference can lead to variations in the final vocabulary and how words are segmented. WordPiece might favor merges that create statistically more probable units, even if they aren't strictly the most frequent pair count-wise.

Difference in merge criteria between BPE and WordPiece.

In practice, both methods are effective at creating subword vocabularies that handle large text corpora well, mitigating the OOV problem and keeping vocabulary sizes manageable. The choice between them often depends on the specific model architecture they were originally developed for (e.g., BERT uses WordPiece, GPT-2 uses BPE) or empirical performance on a specific task and dataset. Implementations are readily available, particularly within frameworks like Hugging Face's tokenizers library, which often abstract away the subtle differences for the end-user. Understanding the underlying mechanism, however, helps in appreciating how raw text is transformed into the numerical sequences processed by LLMs.

Was this section helpful?