Masterclass
While perplexity, as discussed previously, is the most common intrinsic metric for language models, closely related metrics grounded in information theory offer alternative perspectives, particularly when comparing models with different tokenization schemes. These metrics measure the average number of bits required to encode each unit of text (character or word/token) according to the model's probability distribution.
Recall that the cross-entropy loss used during training is essentially the average negative log-likelihood of the true next token, typically calculated using the natural logarithm (base e): H(p,q)=−N1∑i=1Nlogeq(wi∣w<i;θ) where q is the model's distribution and N is the number of tokens. Perplexity is the exponentiation of this loss: PPL=exp(H(p,q)).
Information theory, however, often measures information content in bits (base 2 logarithm). A model that assigns higher probabilities to the observed sequence is more "certain" and requires fewer bits, on average, to encode that sequence according to its probability assignments. This leads us to metrics like Bits Per Character and Bits Per Word/Token.
Bits Per Character (BPC) measures the average number of bits needed to encode each character in a text sequence, given the model's predictions. It is calculated by taking the cross-entropy loss, computed using the base 2 logarithm, and normalizing by the total number of characters (C) in the sequence, not tokens:
BPC=−C1∑j=1Clog2p(charj∣context;θ)
However, most large language models operate on tokens (words or subwords), not individual characters. Calculating the exact probability of each character requires a character-level model or complex marginalization over token probabilities. A more practical approach for token-based models is to calculate the standard cross-entropy loss per token (base e) and then convert it to bits (base 2) and normalize by the number of characters in the original text.
The relationship between cross-entropy loss (base e, denoted He) calculated per token and BPC is:
BPC=C×ln(2)He×N
Here, N is the number of tokens and C is the number of characters in the evaluation text. The ln(2) factor converts the natural logarithm to the base 2 logarithm (log2(x)=ln(2)ln(x)).
The primary advantage of BPC is its relative insensitivity to the chosen tokenization method. Since it's normalized by character count, you can more fairly compare models that use different tokenizers (e.g., BPE vs. WordPiece vs. Character-level) on the same raw text data. A model might achieve a good perplexity score simply by having a tokenizer that breaks words into many small pieces, but its BPC might reveal if it's truly better at modeling the underlying character sequence.
Let's look at a PyTorch example. Assuming you have the average cross-entropy loss per token from your evaluation loop and the character/token counts:
import torch
import math
# Assume these values are obtained from evaluation
avg_cross_entropy_loss_per_token = 1.85 # Example loss (natural log base)
total_tokens_in_eval_set = 50000 # Example token count
total_chars_in_eval_set = 200000 # Example character count
# Calculate cross-entropy in base 2, normalized by tokens
avg_bits_per_token = avg_cross_entropy_loss_per_token / math.log(2)
# Calculate total bits for the dataset according to the model
total_bits = avg_bits_per_token * total_tokens_in_eval_set
# Calculate Bits Per Character (BPC)
bpc = total_bits / total_chars_in_eval_set
print(
f"Average Cross-Entropy Loss (base e) per Token: "
f"{avg_cross_entropy_loss_per_token:.4f}"
)
print(f"Average Bits Per Token (BPT): {avg_bits_per_token:.4f}")
print(f"Bits Per Character (BPC): {bpc:.4f}")
# Example Output:
# Average Cross-Entropy Loss (base e) per Token: 1.8500
# Average Bits Per Token (BPT): 2.6691
# Bits Per Character (BPC): 0.6673
Bits Per Word (BPW) or Bits Per Token (BPT) is simpler for token-based models. It represents the average number of bits required to encode each token (or word, if using word-level tokenization) based on the model's probability assignments.
It is directly related to the cross-entropy loss (base e, He) and perplexity (PPL) calculated over tokens:
BPT=−N1∑i=1Nlog2p(tokeni∣token<i;θ)=ln(2)He
Furthermore, BPW/BPT has a direct relationship with perplexity:
PPL=2BPT BPT=log2(PPL)
This means BPW/BPT is just the base-2 logarithm of the perplexity. If a model has a perplexity of 16 on a dataset, it means that, on average, predicting the next token is as difficult as choosing uniformly among 16=24 options. This corresponds to a BPT of 4 bits per token.
import torch
import math
# Example values
perplexity = 16.0
avg_cross_entropy_loss_per_token = math.log(perplexity) # H_e = ln(PPL)
# Calculate BPT from perplexity
bpt_from_ppl = math.log2(perplexity)
# Calculate BPT from cross-entropy loss (base e)
bpt_from_loss = avg_cross_entropy_loss_per_token / math.log(2)
print(f"Perplexity (PPL): {perplexity:.4f}")
print(
f"Average Cross-Entropy Loss (base e) per Token: "
f"{avg_cross_entropy_loss_per_token:.4f}"
)
print(f"Bits Per Token (BPT) from PPL: {bpt_from_ppl:.4f}")
print(f"Bits Per Token (BPT) from Loss: {bpt_from_loss:.4f}")
# Example Output:
# Perplexity (PPL): 16.0000
# Average Cross-Entropy Loss (base e) per Token: 2.7726
# Bits Per Token (BPT) from PPL: 4.0000
# Bits Per Token (BPT) from Loss: 4.0000
While BPW/BPT is a straightforward transformation of perplexity, thinking in terms of "bits" provides a stronger connection to information theory and compression limits. A lower BPT indicates a model that is more efficient at compressing the information in the text sequence according to its learned probability distribution. However, unlike BPC, BPW/BPT is highly dependent on the tokenization used. A model using subword units will typically have a lower BPT than a character-level model on the same text, simply because predicting a longer token unit covers more characters at once.
In summary, both BPC and BPW/BPT offer information-theoretic views on intrinsic model performance. BPC provides a tokenizer-agnostic measure useful for comparing fundamentally different models, while BPW/BPT is a direct logarithmic transformation of perplexity, commonly used for evaluating token-based models within their specific tokenization scheme. Both metrics quantify how well the model's probability distribution matches the data, with lower values indicating a better fit.
© 2025 ApX Machine Learning