After transforming your raw sequences into numerical representations using techniques like tokenization, integer encoding, and potentially embedding lookups, and after ensuring uniformity through padding and masking, the next step is to group these processed sequences into batches. Training deep learning models on batches of data, rather than one sample at a time, is standard practice for several important reasons:
For recurrent models, the input data is typically expected in a specific 3D tensor format. While the exact order might vary slightly between frameworks (like TensorFlow or PyTorch), the common structure is:
(batch_size, time_steps, num_features)
Let's break this down:
batch_size
: The number of sequences included in this particular batch.time_steps
: The length of the sequences in the batch. Crucially, due to padding, this dimension will be equal to the length of the longest sequence within that specific batch. All shorter sequences are padded to this length.num_features
: The dimensionality of the representation at each time step.
num_features
will be 1 (the integer ID).num_features
will be the dimension of the embedding vector or the number of input features at each time step.Imagine you have three sequences after integer encoding and padding (where 0 is the padding value):
[10, 25, 31, 0, 0]
[15, 8, 99, 50, 2]
[5, 12, 18, 77, 0]
If these form a batch (batch_size = 3
), the time_steps
dimension must accommodate the longest sequence (Sequence B, length 5). The resulting input batch tensor (assuming num_features = 1
for integer IDs) would look like this conceptually:
[
[[10], [25], [31], [ 0], [ 0]], // Sequence A (padded)
[[15], [ 8], [99], [50], [ 2]], // Sequence B
[[ 5], [12], [18], [77], [ 0]] // Sequence C (padded)
]
The shape of this tensor is (3,5,1).
Remember the masks we generated during the padding step? They become absolutely essential when processing batches. The RNN layers (and subsequent layers or loss functions) need to know which elements in the batch tensor correspond to real data and which are just padding.
Continuing the example above, the corresponding mask tensor would indicate the presence of real data (1) versus padding (0):
[
[1, 1, 1, 0, 0], // Mask for Sequence A
[1, 1, 1, 1, 1], // Mask for Sequence B
[1, 1, 1, 1, 0] // Mask for Sequence C
]
This mask tensor, typically with shape (batch_size, time_steps)
, is often passed alongside the input batch to the RNN layer or used during the loss calculation. Framework APIs for RNN layers usually have specific arguments (e.g., mask
in Keras/TensorFlow, or handled implicitly with packed sequences in PyTorch) to handle this, ensuring that computations related to padded time steps are ignored.
Manually creating these padded batches and masks can be tedious. Thankfully, deep learning frameworks provide high-level utilities to streamline this process:
tf.data.Dataset
API is highly efficient. You can create a dataset of your variable-length sequences and then use methods like padded_batch
. This method automatically groups elements into batches, determines the maximum length within each batch, pads all sequences in that batch to that length, and can often implicitly handle the masking for downstream layers.torch.utils.data.Dataset
and DataLoader
classes are used. You typically define a custom Dataset
to load individual sequences. The DataLoader
handles batching. To achieve padding within batches, you often provide a custom collate_fn
to the DataLoader
. This function takes a list of samples (your sequences) and manually pads them (e.g., using torch.nn.utils.rnn.pad_sequence
) to form the batch tensor and potentially generates the mask.Here's a conceptual illustration of creating a batch from variable-length sequences:
Flow of batching variable-length sequences. Original sequences are padded to the maximum length within the group (5 in this case) before being stacked into a batch tensor. A corresponding mask identifies the original data points.
By correctly batching your padded and masked sequences, you prepare the data in the precise format needed for efficient and effective training of Recurrent Neural Networks using modern deep learning libraries. This step bridges the gap between preprocessed individual sequences and the input requirements of the model training loop.
© 2025 ApX Machine Learning