Let's consolidate the text preprocessing steps we've discussed into a reusable pipeline. Building a structured process is beneficial for applying the same transformations consistently to training, validation, and new prediction data. We'll focus on creating a pipeline for text data, transforming raw sentences into padded integer sequences suitable for an embedding layer followed by recurrent layers like LSTMs or GRUs.
Our objective is to create a function or class that takes a list of raw text documents (e.g., sentences or paragraphs) and outputs a numerical tensor where each document is represented as a sequence of integers, padded to a uniform length.
We'll primarily use tools available within common deep learning frameworks. For instance, TensorFlow/Keras provides the tensorflow.keras.preprocessing.text.Tokenizer
class and the tensorflow.keras.preprocessing.sequence.pad_sequences
function, which handle most of the heavy lifting. If you're using PyTorch, libraries like torchtext
offer similar functionalities. In this example, we'll illustrate using the TensorFlow/Keras utilities.
The core steps involved in our text preparation pipeline are:
Here's a conceptual flow:
The text data preparation pipeline transforms raw text into a fixed-size numerical tensor suitable for RNN input.
Let's walk through the implementation using TensorFlow/Keras.
First, let's define some sample text data we want to process.
# Sample corpus of text documents
corpus = [
"Recurrent networks process sequences.",
"LSTMs handle long dependencies.",
"Padding makes sequences uniform.",
"Embeddings represent words numerically."
]
# New data to be processed later using the *same* fitted pipeline
new_data = [
"GRUs are another type of recurrent network.",
"Uniform sequence length is needed for batching."
]
We initialize a Tokenizer
and fit it on our corpus
. The num_words
argument limits the vocabulary size to the most frequent words. Setting oov_token
ensures that words encountered later, which were not in the initial fitting corpus, are assigned a special "out-of-vocabulary" token.
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Configuration
vocab_size = 20 # Max number of words to keep
oov_tok = "<OOV>" # Token for out-of-vocabulary words
# Initialize and fit the tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(corpus)
# Display the learned word index (vocabulary)
word_index = tokenizer.word_index
print("Word Index:")
# Display a subset for brevity
print({k: word_index[k] for k in list(word_index)[:15]})
This fitting process builds the internal vocabulary. word_index
contains the mapping from words to integers. Notice that num_words
controls the vocabulary size used during encoding, not the size of word_index
itself. The tokenizer reserves index 0 and assigns the OOV token index 1.
Now, we use the fitted tokenizer
to convert the text sentences into sequences of integers.
# Convert corpus to integer sequences
sequences = tokenizer.texts_to_sequences(corpus)
print("\nOriginal Corpus Integer Sequences:")
for seq in sequences:
print(seq)
Each list of integers corresponds to a sentence in the corpus
, using the indices learned in the previous step.
Sequences currently have different lengths. We use pad_sequences
to make them uniform. maxlen
defines the target length. padding='post'
adds padding (0s) at the end, while truncating='post'
removes elements from the end if a sequence is too long. 'pre' is also a common choice, especially for LSTMs/GRUs, as it keeps the most recent information at the end of the sequence.
# Padding configuration
max_sequence_len = 10 # Define a maximum length
padding_type = 'post'
truncation_type = 'post'
# Pad the sequences
padded_sequences = pad_sequences(sequences,
maxlen=max_sequence_len,
padding=padding_type,
truncating=truncation_type)
print("\nPadded Corpus Sequences (Tensor Shape: {}):".format(padded_sequences.shape))
print(padded_sequences)
The output padded_sequences
is now a NumPy array (or TensorFlow tensor) with shape (number_of_samples, max_sequence_len)
. This is the format expected by an embedding layer in your RNN model. Index 0 is the default padding value.
A significant advantage of this pipeline is applying the same transformations to new, unseen data using the already fitted tokenizer.
# Convert new data to integer sequences using the *same* tokenizer
new_sequences = tokenizer.texts_to_sequences(new_data)
print("\nNew Data Integer Sequences:")
for seq in new_sequences:
print(seq) # Notice the OOV token (index 1) for words not in the original fit
# Pad the new sequences using the *same* parameters
new_padded_sequences = pad_sequences(new_sequences,
maxlen=max_sequence_len,
padding=padding_type,
truncating=truncation_type)
print("\nPadded New Data Sequences (Tensor Shape: {}):".format(new_padded_sequences.shape))
print(new_padded_sequences)
Notice how words like "grus" or "batching", which were not in the original corpus
, are mapped to the oov_token
index (1). This ensures consistent processing.
For better reusability, you can wrap these steps into a function or a class. Here's a functional approach:
def create_text_pipeline(training_texts, max_vocab_size=10000, oov_token="<OOV>"):
"""Fits a tokenizer on training texts."""
tokenizer = Tokenizer(num_words=max_vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(training_texts)
return tokenizer
def preprocess_texts(texts, tokenizer, max_len, padding='post', truncating='post'):
"""Applies tokenization and padding using a fitted tokenizer."""
sequences = tokenizer.texts_to_sequences(texts)
padded = pad_sequences(sequences, maxlen=max_len, padding=padding, truncating=truncating)
return padded
# --- Usage Example ---
# 1. Create pipeline (fit tokenizer) on training data
train_corpus = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]
pipeline_tokenizer = create_text_pipeline(train_corpus, max_vocab_size=100)
pipeline_maxlen = 10 # Set based on analysis or requirement
# 2. Process training data
train_padded = preprocess_texts(train_corpus, pipeline_tokenizer, pipeline_maxlen)
print("\n--- Pipeline Usage ---")
print("Processed Training Data Shape:", train_padded.shape)
# print(train_padded) # Optionally print the array
# 3. Process new/validation/test data
validation_data = ["This is another document to process."]
validation_padded = preprocess_texts(validation_data, pipeline_tokenizer, pipeline_maxlen)
print("Processed Validation Data Shape:", validation_padded.shape)
print(validation_padded)
padded_sequences
tensor is now ready. The typical next step in a model is an Embedding
layer. This layer will take these integer sequences and convert each integer into a dense vector representation. If you use padding value 0 (the default), configure the Embedding
layer with mask_zero=True
(in Keras) or ensure subsequent RNN layers handle masking appropriately, so the model learns to ignore these padded steps.maxlen
: Choosing vocab_size
and max_sequence_len
involves trade-offs. Larger values capture more information but increase model size, memory usage, and computation. Analyze your data (e.g., sequence length distribution) to make informed choices.tf.data
or PyTorch DataLoader
help integrate such preprocessing steps efficiently into your data loading workflow.This practical pipeline provides a standard and repeatable way to prepare your text data, forming a fundamental step before training sophisticated sequence models like LSTMs and GRUs.
© 2025 ApX Machine Learning