Now that we understand how to prepare data and structure it into batches with appropriate padding and masks, let's look at how we actually measure the performance of our Transformer model during training. The goal of training is to adjust the model's parameters (weights and biases) so that its predictions get closer and closer to the actual target sequences. To guide this process, we need a way to quantify the "error" or "difference" between the predicted output and the true output. This is the role of the loss function (also sometimes called the cost function or objective function).
For most sequence-to-sequence tasks tackled by Transformers, such as machine translation, text summarization, or question answering, the model needs to predict a sequence of tokens (words, subwords, characters). At each step in the output sequence, the model effectively performs a classification task: it tries to predict the correct next token from the entire vocabulary.
Given this classification nature at each time step, the most common loss function used for training Transformers on sequence generation tasks is Cross-Entropy Loss. You might recall Cross-Entropy from classification problems in simpler neural networks. Here, we apply it repeatedly across the sequence.
Let's break down how it works in the context of Transformers:
Model Output: For each position in the output sequence, the Transformer's decoder (specifically, the final linear layer followed by a softmax function) produces a probability distribution over the entire target vocabulary. If our vocabulary has V tokens, the output at position t is a vector of probabilities Pt=[y^t,1,y^t,2,...,y^t,V], where y^t,i is the model's predicted probability that the token at position t is the i-th token in the vocabulary. These probabilities sum to 1 (∑i=1Vy^t,i=1).
Target Output: The actual target sequence provides the "ground truth". For a given position t, we know the correct token. This is often represented implicitly as an integer index corresponding to the correct token's position in the vocabulary, or explicitly as a one-hot encoded vector Yt of size V. This vector has a 1 at the index of the correct token and 0s everywhere else. For example, if the correct token is the 5th word in the vocabulary, Yt=[0,0,0,0,1,0,...,0].
Calculating Loss: Cross-Entropy Loss measures the difference between the predicted probability distribution Pt and the true distribution (represented by Yt). For a single position t, the formula is:
H(Yt,Pt)=−i=1∑VYt,ilog(y^t,i)Since Yt is a one-hot vector, only the term corresponding to the correct token (let's say its index is c) is non-zero (Yt,c=1). Therefore, the sum simplifies to:
H(Yt,Pt)=−log(y^t,c)This means the loss for a single position is simply the negative logarithm of the probability assigned by the model to the correct token.
We calculate this cross-entropy loss for every position in the target sequence (except for padding positions, see below). To get the overall loss for a single sequence, we typically average the loss values across all the valid (non-padded) positions in that sequence.
Finally, since we train using mini-batches, the loss for the entire batch is usually the average of the sequence losses for all sequences within that batch. This final average batch loss is the value that the optimization algorithm (like Adam) tries to minimize by updating the model's parameters through backpropagation.
Remember those padding tokens we added to make sequences in a batch the same length? We don't want the model to be penalized for its predictions at these padded positions. They are just artifacts of the batching process and don't represent real content.
Therefore, when calculating the loss, we need to ensure that the loss contributions from these padded positions are ignored. Most deep learning frameworks provide ways to handle this easily:
torch.nn.CrossEntropyLoss
in PyTorch or tf.keras.losses.SparseCategoricalCrossentropy
in TensorFlow) have parameters (e.g., ignore_index
or by using masked tensors) that allow you to specify the index of the padding token, automatically excluding it from the loss calculation.While standard cross-entropy works well, sometimes models trained with it can become overconfident in their predictions, especially on large datasets. They might assign a probability extremely close to 1.0 to the predicted token. This can sometimes hurt generalization.
A common regularization technique used alongside cross-entropy loss in Transformers is Label Smoothing. Instead of asking the model to predict the target token with 100% probability (using a hard, one-hot target like [0,0,1,0]), label smoothing modifies the target distribution slightly. It assigns a slightly lower probability to the true token (e.g., 1−ϵ) and distributes the remaining small probability mass ϵ across all other tokens in the vocabulary.
For example, with a smoothing factor ϵ=0.1, the target distribution for the correct class might look more like [0.0001,0.0001,0.9,0.0001,...]. This encourages the model to be slightly less certain, which can often lead to better performance and robustness. We won't go into the implementation details here, but it's a useful technique to be aware of when training large Transformer models.
In summary, Cross-Entropy Loss is the workhorse for training Transformers on sequence generation tasks. It effectively measures how well the model predicts the next token at each position, provides a gradient signal for learning, and can be easily adapted to handle padding and incorporate techniques like label smoothing. Understanding how this loss is calculated is significant for interpreting the training process and diagnosing potential issues.
© 2025 ApX Machine Learning