After tokenizing your input text into sequences of numerical IDs, the next step is to organize these sequences into batches suitable for feeding into the Transformer model during training. Simply grouping raw token ID sequences isn't enough, primarily because sequences within a batch almost always have different lengths. Neural networks, especially those implemented in frameworks like PyTorch or TensorFlow, typically require inputs in the form of tensors with uniform shapes. This is where padding and attention masks become essential.
Imagine you want to train your model with a batch containing these two tokenized sentences:
[101, 7592, 2077, 2003, 102]
(5 tokens)[101, 2023, 2003, 1037, 4077, 3681, 102]
(7 tokens)You cannot directly stack these into a single rectangular tensor. To resolve this, we use padding. We choose a maximum sequence length for the batch (often the length of the longest sequence in that specific batch) and add special "padding" tokens to the end of shorter sequences until they all reach this maximum length.
Most tokenizers reserve a specific ID for padding, often 0
. Using this padding token (let's assume its ID is 0
), and padding to the length of Sentence B (7 tokens), our batch would look like this:
[101, 7592, 2077, 2003, 102, 0, 0]
[101, 2023, 2003, 1037, 4077, 3681, 102]
Now, these two sequences can be combined into a single tensor of shape (batch_size, sequence_length)
, which in this case is (2, 7)
.
Padding solves the structural problem, but it introduces a computational one. The self-attention mechanism, as discussed in Chapter 2, calculates scores based on interactions between all tokens in the sequence. We don't want the model to pay attention to these artificial padding tokens, as they contain no meaningful information and attending to them could negatively impact performance.
This is where the attention mask comes in. It's a binary tensor with the same shape as the input ID tensor (or dimensions compatible for broadcasting during the attention calculation). The mask indicates which tokens the model should attend to and which it should ignore.
A common convention is to use:
1
for real tokens (attend).0
for padding tokens (ignore).For our example batch, the corresponding attention mask would be:
[1, 1, 1, 1, 1, 0, 0]
[1, 1, 1, 1, 1, 1, 1]
These masks are typically applied within the attention mechanism before the softmax step. By adding a large negative number (like -10000 or negative infinity) to the attention scores corresponding to padding positions (where the mask is 0
), the subsequent softmax operation effectively assigns near-zero probability to these positions. This ensures padding tokens don't contribute to the context vectors computed by the attention heads.
Putting it together, a typical input batch for training a standard sequence-to-sequence Transformer (like for machine translation) often contains several components, usually organized in a dictionary or similar structure:
input_ids
: The token IDs for the source sequence (e.g., the sentence in the original language), padded to a uniform length within the batch. Shape: (batch_size, source_sequence_length)
.attention_mask
: The attention mask for the source sequence, indicating padding tokens. Shape: (batch_size, source_sequence_length)
.decoder_input_ids
: The token IDs for the target sequence (e.g., the translated sentence), also padded. Critically, for training, this is usually the target sequence shifted right (often starting with a special start-of-sequence token) and truncated. This provides the input to the decoder at each step. Shape: (batch_size, target_sequence_length)
.decoder_attention_mask
: The attention mask for the target sequence. This mask serves a dual purpose: it masks out padding tokens and implements the look-ahead mask required for causal self-attention in the decoder. The look-ahead aspect prevents the decoder from attending to future tokens during training, ensuring it only uses previous tokens to predict the next one. Shape: (batch_size, target_sequence_length)
.labels
: The actual target sequence IDs the model should predict, used for calculating the loss function. This is often the decoder_input_ids
shifted left, without the initial start-of-sequence token, and potentially masking out padding tokens in the loss calculation itself. Shape: (batch_size, target_sequence_length)
.Here's a simplified text representation of how these components might look for a single training example (batch size 1) for a translation task "Hello world" -> "Bonjour le monde", assuming simplified tokenization, padding to length 5, and specific token IDs:
Source: "Hello world" -> Tokens: [101, 87, 99, 102]
Target: "Bonjour le monde" -> Tokens: [201, 150, 160, 170, 202]
Padding Token ID: 0
Max Length: 5
input_ids: [101, 87, 99, 102, 0]
attention_mask: [ 1, 1, 1, 1, 0]
# Decoder inputs are typically shifted right (start token 201)
decoder_input_ids: [201, 150, 160, 170, 202] # Assume target was exactly length 5 here
# Decoder mask needs padding AND look-ahead
# (Simplified: just showing padding mask here)
decoder_attention_mask:[ 1, 1, 1, 1, 1]
# Labels are the target tokens the model should predict
labels: [150, 160, 170, 202, 0] # Padded labels for loss calculation
Simplified batch components for a single example. Note that
decoder_attention_mask
in a real implementation would also incorporate the causal (look-ahead) masking.labels
are often thedecoder_input_ids
shifted left, with padding.
Modern deep learning libraries provide helpful tools for managing this process. For instance, tokenizers from libraries like Hugging Face transformers
often automatically generate attention masks when encoding text. Furthermore, data loading utilities (DataLoader
in PyTorch, tf.data
in TensorFlow) often include collating functions that can dynamically pad sequences within each batch to the maximum length required for that specific batch, which is more efficient than padding all sequences to a fixed global maximum length.
Understanding how these batches are constructed, including the role of padding and attention masks, is fundamental for correctly preparing data and training Transformer models effectively. With properly formatted batches, we can now move on to defining the loss function that will guide the model's learning process.
© 2025 ApX Machine Learning