After converting your sequences of tokens (like words or characters) into sequences of integers, you'll likely encounter a practical challenge: the sequences rarely have the same length. For example, sentences in a text dataset naturally vary in word count, and different time series segments might cover different durations.
Consider these integer-encoded sentences:
[12, 45, 6, 887, 3]
(Length 5)[12, 101, 500]
(Length 3)[99, 2, 76, 1024, 50, 1]
(Length 6)While Recurrent Neural Networks are designed to handle sequential data, the underlying operations in deep learning frameworks usually rely on processing data in batches represented as uniform tensors. Stacking these variable-length sequences directly into a single tensor is not straightforward, as tensors require consistent dimensions. Trying to feed a list of lists with different lengths directly into most RNN layers will typically result in errors.
This is where padding comes in. Padding is the standard technique used to enforce a uniform length across all sequences within a batch. It works by adding a special, reserved value (almost always 0) to the shorter sequences until they all match a designated length.
The core idea is simple: augment shorter sequences with placeholder values. Typically, the value 0
is reserved for padding. This is convenient because vocabulary mappings often start indexing actual tokens from 1
. If your vocabulary encoding uses 0
for a real token, you'll need to choose a different padding value or shift your vocabulary indices.
Let's revisit our example sequences. If we decide to pad them all to the length of the longest sequence in the batch (which is 6), they become:
[12, 45, 6, 887, 3, 0]
[12, 101, 500, 0, 0, 0]
[99, 2, 76, 1024, 50, 1]
(No padding needed)Now, these three sequences can be stacked neatly into a single tensor of shape (3, 6), where 3 is the batch size and 6 is the uniform sequence length (or number of time steps).
You have two primary options for where to add the padding values:
Pre-padding: Add the padding values at the beginning of the sequence.
[0, 12, 45, 6, 887, 3]
[0, 0, 0, 12, 101, 500]
Post-padding: Add the padding values at the end of the sequence (as shown in the first example).
[12, 45, 6, 887, 3, 0]
[12, 101, 500, 0, 0, 0]
Does the choice matter? Sometimes. Since RNNs process sequences step-by-step, the final hidden state is influenced more strongly by later elements in the sequence. With post-padding, the actual content ends before the padding starts, and the final hidden state reflects the end of the original sequence content. With pre-padding, the RNN processes potentially many padding steps before seeing the actual data.
In practice, especially when using mechanisms like masking (which we'll discuss next), the choice often has a minimal impact on final performance for many tasks. Post-padding is perhaps slightly more intuitive, but pre-padding is the default in some libraries (like Keras' pad_sequences
function) and can occasionally be beneficial depending on the specific architecture and task. Consistency within your project is generally more important than the absolute choice between pre- and post-padding.
How long should the padded sequences be? There are two common strategies:
maxlen
): Define a fixed maximum length (maxlen
) for all sequences across the entire dataset. Any sequence longer than maxlen
is truncated (either from the beginning or end), and any sequence shorter is padded (pre or post) up to maxlen
. This guarantees a consistent input tensor shape for all batches, which simplifies model building. The downside is potential inefficiency if maxlen
is much larger than the typical sequence length, leading to tensors with lots of padding, or information loss if many sequences are truncated. Choosing an appropriate maxlen
often involves analyzing the distribution of sequence lengths in your dataset.Deep learning frameworks provide convenient functions to handle padding. For instance, TensorFlow/Keras offers the tf.keras.preprocessing.sequence.pad_sequences
utility, which takes a list of sequences (as Python lists of integers) and performs padding according to specified options like maxlen
, padding
('pre' or 'post'), and truncating
('pre' or 'post'). PyTorch doesn't have a single identical function, but padding can be implemented using tensor operations or utilities within its data loading and batching mechanisms (like pad_sequence
in torch.nn.utils.rnn
).
Padding solves the technical problem of creating uniform tensors, but it introduces artificial 0
values into our data. We don't want the RNN to treat these padding zeros as meaningful inputs, as they don't represent actual information from the original sequence. Processing these zeros could negatively impact the learned representations and the final output.
Therefore, after padding, it's essential to let the model know which parts of the input tensor correspond to real data and which parts are just padding. This is achieved through masking, which we will cover in the next section. Masking effectively tells the subsequent layers (like RNN or attention layers) to ignore the padded time steps during their computations.
© 2025 ApX Machine Learning