Masterclass
As we've established, perplexity serves as a fundamental intrinsic metric for evaluating language models, quantifying how well the model predicts a given sequence of text. It's derived directly from the model's assigned probabilities to the tokens in that sequence. However, a significant point often overlooked is that the calculated perplexity is highly sensitive to the specific tokenization scheme used. Recall the perplexity formula:
PPL(W)=exp(−N1∑i=1Nlogp(wi∣w<i;θ))
Here, wi represents the i-th token in the sequence, and N is the total number of tokens. This means that both the individual probabilities p(wi∣w<i;θ) and the sequence length N are direct functions of how the raw text is segmented into tokens.
Consider a simple sentence: "Tokenization impacts perplexity."
Let's see how different tokenizers might handle this:
["Tokenization", "impacts", "perplexity", "."]
-> N=4 tokens. The model predicts the probability of "impacts" given "Tokenization", then "perplexity" given the first two, and so on.['T', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 'm', 'p', 'a', 'c', 't', 's', ' ', 'p', 'e', 'r', 'p', 'l', 'e', 'x', 'i', 't', 'y', '.']
-> N=33 tokens. The model predicts 'o' given 'T', 'k' given 'To', etc. The prediction task is very different.["Token", "ization", "Ġimpacts", "Ġperplex", "ity", "."]
-> N=6 tokens (assuming a GPT-2 style tokenizer where Ġ
indicates a space). Here, the model predicts "ization" given "Token", "Ġimpacts" given "Tokenization", and so forth.This example highlights two main effects:
Let's illustrate the token difference with a brief PyTorch example using the transformers
library:
import torch
from transformers import AutoTokenizer
# Load two different tokenizers
tokenizer_bert = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer_gpt2 = AutoTokenizer.from_pretrained('gpt2')
text = "Tokenization impacts perplexity."
# Tokenize with BERT tokenizer (WordPiece)
tokens_bert = tokenizer_bert.tokenize(text)
ids_bert = tokenizer_bert.encode(text)
print(f"BERT Tokens ({len(tokens_bert)}): {tokens_bert}")
# Output: BERT Tokens (6): ['token', '##ization', 'impacts', 'per',
# '##plex', '##ity', '.']
print(f"BERT IDs ({len(ids_bert)}): {ids_bert}")
# Output: BERT IDs (9): [101, 19204, 17260, 7296, 2361, 18049, 4234,
# 1012, 102]
# Note: Includes [CLS] and [SEP] tokens
# Tokenize with GPT-2 tokenizer (BPE)
tokens_gpt2 = tokenizer_gpt2.tokenize(text)
ids_gpt2 = tokenizer_gpt2.encode(text)
print(f"GPT-2 Tokens ({len(tokens_gpt2)}): {tokens_gpt2}")
# Output: GPT-2 Tokens (6): ['Token', 'ization', 'Ġimpacts', 'Ġperplex',
# 'ity', '.']
print(f"GPT-2 IDs ({len(ids_gpt2)}): {ids_gpt2}")
# Output: GPT-2 IDs (6): [11934, 10004, 33333, 21119, 2138, 13]
# Note: GPT-2 tokenizer doesn't add special tokens by default here,
# length matches tokens
Notice that even between two subword tokenizers (BERT's WordPiece and GPT-2's BPE), the segmentation differs ('token', '##ization'
vs 'Token', 'ization'
), and the inclusion of special tokens like [CLS]
and [SEP]
by default in BERT's encode
method affects the sequence length (N) used in a standard perplexity calculation.
The direct consequence is that perplexity scores are only directly comparable between models if they use the exact same tokenizer and vocabulary on the evaluation dataset. Comparing the perplexity of a model using BPE with 50,000 merges to one using WordPiece with a 30,000-word vocabulary is like comparing apples and oranges. The underlying units of prediction are different.
The chart above illustrates how perplexity scores for the same underlying text might vary dramatically just based on the chosen tokenizer. Lower values suggest the model finds the prediction task easier per token, but this doesn't allow direct comparison of model quality across different tokenizations.
Furthermore, preprocessing steps applied before tokenization, such as lowercasing or Unicode normalization, also interact with this. If one evaluation lowercases the text and another does not, a case-sensitive tokenizer will produce different tokens, leading to incomparable perplexity scores.
When reporting or interpreting perplexity, always ensure you know exactly which tokenizer was used, including its vocabulary size and any associated preprocessing steps. Without this context, a raw perplexity number offers limited insight into the model's relative capabilities compared to others evaluated under different tokenization regimes. The most reliable comparisons are made when evaluating different models or checkpoints using an identical evaluation setup, including the tokenizer.
© 2025 ApX Machine Learning