Masterclass
Probability and statistics provide the mathematical language for dealing with uncertainty, which is fundamental to language modeling. At its core, a language model assigns probabilities to sequences of words or tokens. Understanding these concepts helps interpret model predictions, formulate training objectives, and analyze model behavior. This section reviews the essential probabilistic ideas underpinning large language models.
In the context of language models, we are often interested in predicting the next token in a sequence. Given a vocabulary V of possible tokens, the model's prediction for the next token can be represented as a discrete probability distribution over V. This means assigning a probability P(token=t) to each token t∈V, such that all probabilities are non-negative and sum to one: ∑t∈VP(token=t)=1.
The distribution produced by a language model for the next token, given the preceding context, is typically a multinomial distribution. This distribution generalizes the binomial distribution to cases with multiple possible outcomes (the tokens in the vocabulary). The model outputs a vector of probabilities, one for each token, representing the likelihood of that token being the next one in the sequence.
For example, if our vocabulary is simply {cat
, dog
, sat
, mat
} and the preceding context is "the cat", the model might output probabilities like:
These probabilities must sum to 1.0.
A primary goal of language modeling is to estimate the probability of an entire sequence of tokens, P(w1,w2,…,wn). The chain rule of probability allows us to decompose this joint probability into a product of conditional probabilities:
P(w1,w2,…,wn)=P(w1)P(w2∣w1)P(w3∣w1,w2)…P(wn∣w1,…,wn−1)This can be written more compactly as:
P(w1,…,wn)=i=1∏nP(wi∣w1,…,wi−1)where P(w1∣w0) is typically simplified to P(w1). Large language models, particularly autoregressive ones like the GPT family, are trained to approximate these conditional probabilities P(wi∣w1,…,wi−1). The model takes the preceding sequence (the context) as input and outputs a probability distribution over the vocabulary for the next token wi.
The final layer of most transformer-based language models is a linear transformation followed by a softmax function. The softmax function converts a vector of raw scores (logits) z=(z1,z2,…,z∣V∣) into a probability distribution p=(p1,p2,…,p∣V∣):
pj=softmax(z)j=∑k=1∣V∣ezkezjEach pj represents the model's estimated probability for the j-th token in the vocabulary being the next token.
import torch
import torch.nn.functional as F
# Example logits output by a model for a vocabulary of size 5
logits = torch.tensor([1.0, -0.5, 3.0, 0.0, 1.5])
# Apply softmax to get probabilities
probabilities = F.softmax(logits, dim=0)
print(f"Logits: {logits}")
print(f"Probabilities: {probabilities}")
print(f"Sum of probabilities: {probabilities.sum()}")
# Output:
# Logits: tensor([ 1.0000, -0.5000, 3.0000, 0.0000, 1.5000])
# Probabilities: tensor([0.1083, 0.0242, 0.7985, 0.0398, 0.1291])
# Sum of probabilities: 1.0
Information theory provides tools to quantify information and compare probability distributions, which are useful for understanding and training language models.
Entropy, denoted H(P) for a discrete probability distribution P over a set X, measures the average amount of uncertainty or "surprise" associated with the outcomes. It's calculated as:
H(P)=−x∈X∑P(x)log2P(x)The unit is typically bits when using log2. A distribution with high entropy is more uncertain (e.g., a uniform distribution), while a distribution with low entropy is more peaked (predictable). For a language model, lower entropy over the next token prediction implies higher confidence.
Cross-entropy, denoted H(P,Q), measures the average number of bits needed to identify an event drawn from distribution P when using an optimal code designed for a different distribution Q.
H(P,Q)=−x∈X∑P(x)log2Q(x)In machine learning, cross-entropy is commonly used as a loss function. P represents the "true" distribution (often a one-hot vector where the correct token has probability 1 and others 0), and Q represents the model's predicted probability distribution. Minimizing cross-entropy loss during training pushes the model's predicted distribution Q to become closer to the true distribution P. For a single target label y (represented as a one-hot vector) and model predictions q, the cross-entropy loss simplifies to −log2qy, where qy is the probability the model assigned to the correct label.
import torch
import torch.nn as nn
# Example: Model predictions (logits) and true target label
# Batch size = 1, Vocabulary size = 5
logits = torch.tensor([[1.0, -0.5, 3.0, 0.0, 1.5]]) # Logits for one instance
target = torch.tensor([2]) # True label index is 2 (the token with logit 3.0)
# PyTorch's CrossEntropyLoss combines LogSoftmax and NLLLoss
# It expects raw logits as input and class indices as target
criterion = nn.CrossEntropyLoss()
loss = criterion(logits, target)
print(f"Logits: {logits}")
print(f"Target index: {target}")
print(f"Cross-Entropy Loss: {loss.item()}")
# Manual calculation:
probabilities = F.softmax(logits, dim=1)
# Loss = -log(probability of correct class)
manual_loss = -torch.log(probabilities[0, target.item()])
print(f"Manual Calculation: -log(P(target)) = {manual_loss.item()}")
# Output:
# Logits: tensor([[ 1.0000, -0.5000, 3.0000, 0.0000, 1.5000]])
# Target index: tensor([2])
# Cross-Entropy Loss: 0.22514162957668304
# Manual Calculation: -log(P(target)) = 0.22514162957668304
Kullback-Leibler (KL) divergence, denoted DKL(P∣∣Q), measures the difference between two probability distributions P and Q. It quantifies how much information is lost when approximating P with Q.
DKL(P∣∣Q)=x∈X∑P(x)log2Q(x)P(x)=H(P,Q)−H(P)KL divergence is always non-negative (DKL(P∣∣Q)≥0) and is zero only if P=Q. It's not symmetric, meaning DKL(P∣∣Q)=DKL(Q∣∣P) in general. Minimizing KL divergence between the true distribution P and the model's prediction Q is equivalent to minimizing cross-entropy H(P,Q) when the true distribution P (and thus its entropy H(P)) is fixed, which is usually the case in supervised learning.
Once a model produces a probability distribution over the vocabulary, we often need to sample from this distribution, especially for tasks like text generation.
The simplest method is to sample directly according to the predicted probabilities. Tokens with higher probabilities are more likely to be chosen.
To control the randomness of the sampling process, temperature scaling is often applied to the logits z before the softmax:
pj=∑k=1∣V∣ezk/Tezj/THere, T is the temperature parameter:
import torch
import torch.nn.functional as F
logits = torch.tensor([1.0, -0.5, 3.0, 0.0, 1.5]) # Same logits as before
temperatures = [0.5, 1.0, 2.0]
sampled_tokens = {}
print(f"Original Logits: {logits}\n")
for T in temperatures:
scaled_logits = logits / T
probabilities = F.softmax(scaled_logits, dim=0)
print(f"Temperature = {T}")
print(f"Scaled Logits: {scaled_logits.numpy().round(2)}")
print(f"Probabilities: {probabilities.numpy().round(3)}")
# Sample one token based on the probabilities
# Use multinomial for sampling;
# we'll sample multiple times for illustration
samples = torch.multinomial(
probabilities,
num_samples=10,
replacement=True
)
sampled_list = samples.tolist()
print(f"Sampled token indices (10 samples): {sampled_list}\n")
sampled_tokens[T] = probabilities.numpy()
# Output:
# Original Logits: tensor([1.0000, -0.5000, 3.0000, 0.0000,
# 1.5000])
# Temperature = 0.5
# Scaled Logits: [ 2. -1. 6. 0. 3. ]
# Probabilities: [0.036 0.002 0.881 0.005 0.076]
# Sampled token indices (10 samples): [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
# # Very likely to pick index 2
# Temperature = 1.0
# Scaled Logits: [ 1. -0.5 3. 0. 1.5]
# Probabilities: [0.108 0.024 0.799 0.04 0.129]
# Sampled token indices (10 samples): [2, 2, 2, 0, 4, 2, 2, 2, 2, 4]
# # Mostly index 2, some others possible
# Temperature = 2.0
# Scaled Logits: [ 0.5 -0.25 1.5 0. 0.75]
# Probabilities: [0.187 0.088 0.46 0.113 0.153]
# Sampled token indices (10 samples): [2, 2, 0, 4, 0, 2, 3, 2, 2, 0]
# # More diversity
Probability distributions over a vocabulary of 5 tokens, calculated using different temperature values (T) applied to the same initial logits. Lower temperatures concentrate probability mass on the most likely token (index 2), while higher temperatures yield a more uniform distribution.
Besides temperature, other techniques like Top-k sampling (sampling only from the k most likely tokens) and Top-p (nucleus) sampling (sampling from the smallest set of tokens whose cumulative probability exceeds a threshold p) are commonly used to balance quality and diversity in generated text.
A solid understanding of these probability and statistics fundamentals is indispensable when working with large language models. It allows you to reason about model predictions, understand the training process via loss functions like cross-entropy, and control text generation using sampling techniques.
© 2025 ApX Machine Learning