Handling Attention Masks

When we pad sequences to ensure uniform batch dimensions, we introduce artificial tokens into our data. A neural network processes matrices blindly. If we feed a padded sequence directly into a transformer, the self-attention mechanism will treat padding tokens as meaningful parts of the text. To prevent the model from assigning mathematical importance to empty space, we use attention masks.

An attention mask is a binary tensor that perfectly matches the dimensions of the input IDs. It acts as a filter for the attention mechanism. In this tensor, a value of 1 indicates a real token that the model should attend to, and a value of 0 indicates a padding token that the model should completely ignore.

Mapping of input sequence tokens to their corresponding binary attention mask values.

The Mathematics of Masking

To understand why the attention mask is necessary, we must look at the mathematical operations happening inside the transformer blocks. The self-attention mechanism calculates scores to determine how much focus to place on other parts of the input sequence. This calculation uses Query matrices ( $Q$ ) and Key matrices ( $K$ ).

The formula for scaled dot-product attention is defined as:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$

In this equation, $M$ represents the attention mask. Inside the model architecture, the binary 0s and 1s from our dataset are converted into a different format for this calculation. The positions with a 1 (real tokens) are assigned a value of $0$ in the matrix $M$ . The positions with a 0 (padding tokens) are assigned a very large negative number, practically $-\infty$ .

When the dot products of the queries and keys are calculated, the matrix $M$ is added to the results. For the padded positions, the score becomes infinitely negative. The softmax function is then applied to turn these raw scores into probabilities.

The standard softmax function is:

$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$

Because $e^{-\infty}$ approaches 0, the attention weights for any padding tokens become exactly zero. The attention mask forces the model to distribute 100% of its attention across the actual sequence, effectively deleting the padding tokens from the self-attention calculation.

Generating Masks in Practice

Modern natural language processing libraries abstract away the complexity of manual mask creation. When you initialize a tokenizer and pass it a batch of texts, it automatically generates both the tokenized integers and the corresponding attention mask.

Here is an example of generating an attention mask using the Hugging Face tokenizers:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer.pad_token = tokenizer.eos_token

batch_sentences = [
    "What is fine-tuning?",
    "A short sentence."
]

encoded_inputs = tokenizer(
    batch_sentences,
    padding="max_length",
    max_length=8,
    return_tensors="pt"
)

print(encoded_inputs["attention_mask"])

The output of this operation will be a PyTorch tensor containing the binary masks for the batch. The first, longer sentence will have ones across its active tokens, while the shorter sentence will display trailing zeros where padding was applied to reach the maximum length of eight.

Masking Labels for Loss Calculation

There is a second masking operation required during supervised fine-tuning. We must mask the target labels. While the attention mask prevents the input from attending to padding, the model still outputs a prediction for every single token position in the sequence. This includes outputting predictions for the padding positions.

If we include padding tokens in the loss calculation, the model will spend processing time learning how to predict padding tokens. This degrades the training efficiency and can pollute the final model weights.

In PyTorch, the standard cross-entropy loss function is designed to ignore any target label with a specific value. By default, this index is set to -100. We must create a copy of our labels and replace all padding token indices with -100 before passing the batch to the training loop.

labels = encoded_inputs["input_ids"].clone()

# Find all positions where the input is a padding token
padding_condition = labels == tokenizer.pad_token_id

# Replace those positions with -100 in the labels tensor
labels[padding_condition] = -100

By applying an attention mask to the input forward pass and applying a -100 mask to the target labels, we ensure the model strictly learns from the specific text content provided in the dataset. The padded data serves only to satisfy the dimensional requirements of the graphics processing unit matrices.

References

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems, Vol. 30 DOI: 10.48550/arXiv.1706.03762 - The foundational paper that introduced the transformer architecture and the mathematical implementation of scaled dot-product attention.
Padding and truncation, Hugging Face Team, 2024 Hugging Face Transformers Documentation - Official documentation explaining how the Transformers library handles variable sequence lengths and generates attention masks.
CrossEntropyLoss, PyTorch Contributors, 2024 PyTorch Documentation - Technical documentation for the loss function used in fine-tuning, explaining why the index -100 is used to ignore specific labels.
Natural Language Processing with Transformers, Lewis Tunstall, Leandro von Werra, Thomas Wolf, 2022 (O'Reilly Media) - A comprehensive guide to the Hugging Face ecosystem, providing details on tokenization and the practical implementation of data collators.