When we pad sequences to ensure uniform batch dimensions, we introduce artificial tokens into our data. A neural network processes matrices blindly. If we feed a padded sequence directly into a transformer, the self-attention mechanism will treat padding tokens as meaningful parts of the text. To prevent the model from assigning mathematical importance to empty space, we use attention masks.
An attention mask is a binary tensor that perfectly matches the dimensions of the input IDs. It acts as a filter for the attention mechanism. In this tensor, a value of 1 indicates a real token that the model should attend to, and a value of 0 indicates a padding token that the model should completely ignore.
Mapping of input sequence tokens to their corresponding binary attention mask values.
To understand why the attention mask is necessary, we must look at the mathematical operations happening inside the transformer blocks. The self-attention mechanism calculates scores to determine how much focus to place on other parts of the input sequence. This calculation uses Query matrices () and Key matrices ().
The formula for scaled dot-product attention is defined as:
In this equation, represents the attention mask. Inside the model architecture, the binary 0s and 1s from our dataset are converted into a different format for this calculation. The positions with a 1 (real tokens) are assigned a value of in the matrix . The positions with a 0 (padding tokens) are assigned a very large negative number, practically .
When the dot products of the queries and keys are calculated, the matrix is added to the results. For the padded positions, the score becomes infinitely negative. The softmax function is then applied to turn these raw scores into probabilities.
The standard softmax function is:
Because approaches 0, the attention weights for any padding tokens become exactly zero. The attention mask forces the model to distribute 100% of its attention across the actual sequence, effectively deleting the padding tokens from the self-attention calculation.
Modern natural language processing libraries abstract away the complexity of manual mask creation. When you initialize a tokenizer and pass it a batch of texts, it automatically generates both the tokenized integers and the corresponding attention mask.
Here is an example of generating an attention mask using the Hugging Face tokenizers:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer.pad_token = tokenizer.eos_token
batch_sentences = [
"What is fine-tuning?",
"A short sentence."
]
encoded_inputs = tokenizer(
batch_sentences,
padding="max_length",
max_length=8,
return_tensors="pt"
)
print(encoded_inputs["attention_mask"])
The output of this operation will be a PyTorch tensor containing the binary masks for the batch. The first, longer sentence will have ones across its active tokens, while the shorter sentence will display trailing zeros where padding was applied to reach the maximum length of eight.
There is a second masking operation required during supervised fine-tuning. We must mask the target labels. While the attention mask prevents the input from attending to padding, the model still outputs a prediction for every single token position in the sequence. This includes outputting predictions for the padding positions.
If we include padding tokens in the loss calculation, the model will spend processing time learning how to predict padding tokens. This degrades the training efficiency and can pollute the final model weights.
In PyTorch, the standard cross-entropy loss function is designed to ignore any target label with a specific value. By default, this index is set to -100. We must create a copy of our labels and replace all padding token indices with -100 before passing the batch to the training loop.
labels = encoded_inputs["input_ids"].clone()
# Find all positions where the input is a padding token
padding_condition = labels == tokenizer.pad_token_id
# Replace those positions with -100 in the labels tensor
labels[padding_condition] = -100
By applying an attention mask to the input forward pass and applying a -100 mask to the target labels, we ensure the model strictly learns from the specific text content provided in the dataset. The padded data serves only to satisfy the dimensional requirements of the graphics processing unit matrices.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•