Masterclass
As introduced earlier, intrinsic evaluation assesses a language model's core capability: predicting the next token in a sequence. We don't need a specific downstream task for this; we directly measure how well the model understands the statistical patterns of the language it was trained on, using a held-out test dataset. The standard metric for this is perplexity.
Perplexity (PPL) quantifies how uncertain a language model is when predicting the next token in a sequence. Think of it as the effective number of choices the model has for the next token, averaged over the sequence. A lower perplexity indicates that the model is more confident and accurate in its predictions for the given test data. It suggests the model assigns higher probabilities to the actual tokens that appear in the test set.
Mathematically, for a sequence of tokens W=w1,w2,...,wN, the perplexity is defined as the exponentiated average negative log-likelihood of the sequence:
PPL(W)=exp(−N1i=1∑Nlogp(wi∣w<i;θ))Here, p(wi∣w<i;θ) is the probability assigned by the model (with parameters θ) to the token wi, given the preceding tokens w<i=w1,...,wi−1. N is the total number of tokens in the sequence.
If you've trained language models, you'll recognize the term inside the exponentiation. The average negative log-likelihood is precisely the cross-entropy loss typically minimized during training.
CrossEntropyLoss(W)=−N1i=1∑Nlogp(wi∣w<i;θ)Therefore, perplexity is simply the exponential of the cross-entropy loss calculated over the test set:
PPL(W)=exp(CrossEntropyLoss(W))This direct relationship is convenient. If you monitor the cross-entropy loss on a validation set during training, you are effectively monitoring a value whose exponential is the perplexity. A model trained to minimize cross-entropy loss is implicitly being trained to minimize perplexity.
To compute perplexity for your trained LLM, you need a representative held-out test set – data the model has not seen during training. The process generally follows these steps:
log_softmax
output and gathering the relevant log probability.Let's illustrate with a simplified PyTorch example. Assume model
is your trained language model, test_loader
provides batches of token IDs from the test set, and loss_fn
is typically torch.nn.CrossEntropyLoss
(which combines LogSoftmax and NLLLoss).
import torch
import math
# Assume model is your trained LLM, and test_loader yields batches of input_ids
# Example loss function (adjust reduction if calculating manually)
# Using reduction='mean' directly gives average loss per token if batches are
# handled correctly.
loss_fn = torch.nn.CrossEntropyLoss(
ignore_index=model.config.pad_token_id
) # Ignore padding
model.eval() # Set model to evaluation mode
total_loss = 0.0
total_tokens = 0
with torch.no_grad(): # Disable gradient calculations for inference
for batch in test_loader:
# Assuming batch is a dictionary like
# {'input_ids': tensor, 'attention_mask': tensor}
# Prepare inputs and labels for causal LM loss calculation
input_ids = batch['input_ids'].to(model.device)
attention_mask = batch['attention_mask'].to(model.device)
# Shift labels for next-token prediction evaluation
# The logits for input_ids[..., i] predict input_ids[..., i+1]
labels = input_ids.clone()
# For CrossEntropyLoss, labels outside the main sequence should be
# ignored Often handled by setting labels to ignore_index where
# attention_mask is 0 Or more simply, shift logits and labels
# relative to each other
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
logits = outputs.logits
# Shift logits and labels so that tokens < n predict token n
# Logits shape: (batch_size, sequence_length, vocab_size)
# Labels shape: (batch_size, sequence_length)
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Calculate loss for this batch
# Flatten the tokens for CrossEntropyLoss
loss = loss_fn(
shift_logits.view(-1, shift_logits.size(-1)),
shift_labels.view(-1)
)
# Accumulate loss, weighting by the number of non-padding tokens
# evaluated
# Count non-ignored tokens (adjust based on your padding/masking
# strategy)
num_tokens_in_batch = (
shift_labels != loss_fn.ignore_index
).sum().item()
# Aggregate loss weighted by number of tokens
total_loss += loss.item() * num_tokens_in_batch
total_tokens += num_tokens_in_batch
if total_tokens > 0:
average_loss = total_loss / total_tokens
perplexity = math.exp(average_loss)
print(f"Test Set Cross-Entropy Loss: {average_loss:.4f}")
print(f"Test Set Perplexity: {perplexity:.4f}")
else:
print("No tokens were evaluated.")
The code snippet demonstrates calculating perplexity using PyTorch's
CrossEntropyLoss
. It processes batches, calculates loss only on non-padded tokens, aggregates the total loss, and computes the final perplexity by exponentiating the average loss per token.
A lower perplexity value indicates the model's probability distribution over the test set is "sharper" and assigns higher probability to the observed sequences. It means the model is less "perplexed" or "surprised" by the test data, suggesting better language modeling performance according to this intrinsic measure.
© 2025 ApX Machine Learning