Text preprocessing steps are consolidated into a reusable pipeline. Building a structured process is beneficial for applying the same transformations consistently to training, validation, and new prediction data. The focus is on creating a pipeline for text data, transforming raw sentences into padded integer sequences suitable for an embedding layer followed by recurrent layers like LSTMs or GRUs.
Our objective is to create a function or class that takes a list of raw text documents (e.g., sentences or paragraphs) and outputs a numerical tensor where each document is represented as a sequence of integers, padded to a uniform length.
We'll primarily use tools available within common deep learning frameworks. For instance, TensorFlow/Keras provides the tensorflow.keras.preprocessing.text.Tokenizer class and the tensorflow.keras.preprocessing.sequence.pad_sequences function, which handle most of the heavy lifting. If you're using PyTorch, libraries like torchtext offer similar functionalities. In this example, we'll illustrate using the TensorFlow/Keras utilities.
The core steps involved in our text preparation pipeline are:
Here's a flow:
The text data preparation pipeline transforms raw text into a fixed-size numerical tensor suitable for RNN input.
Let's walk through the implementation using TensorFlow/Keras.
First, let's define some sample text data we want to process.
# Sample corpus of text documents
corpus = [
"Recurrent networks process sequences.",
"LSTMs handle long dependencies.",
"Padding makes sequences uniform.",
"Embeddings represent words numerically."
]
# New data to be processed later using the *same* fitted pipeline
new_data = [
"GRUs are another type of recurrent network.",
"Uniform sequence length is needed for batching."
]
We initialize a Tokenizer and fit it on our corpus. The num_words argument limits the vocabulary size to the most frequent words. Setting oov_token ensures that words encountered later, which were not in the initial fitting corpus, are assigned a special "out-of-vocabulary" token.
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Configuration
vocab_size = 20 # Max number of words to keep
oov_tok = "<OOV>" # Token for out-of-vocabulary words
# Initialize and fit the tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(corpus)
# Display the learned word index (vocabulary)
word_index = tokenizer.word_index
print("Word Index:")
# Display a subset for brevity
print({k: word_index[k] for k in list(word_index)[:15]})
This fitting process builds the internal vocabulary. word_index contains the mapping from words to integers. Notice that num_words controls the vocabulary size used during encoding, not the size of word_index itself. The tokenizer reserves index 0 and assigns the OOV token index 1.
Now, we use the fitted tokenizer to convert the text sentences into sequences of integers.
# Convert corpus to integer sequences
sequences = tokenizer.texts_to_sequences(corpus)
print("\nOriginal Corpus Integer Sequences:")
for seq in sequences:
print(seq)
Each list of integers corresponds to a sentence in the corpus, using the indices learned in the previous step.
Sequences currently have different lengths. We use pad_sequences to make them uniform. maxlen defines the target length. padding='post' adds padding (0s) at the end, while truncating='post' removes elements from the end if a sequence is too long. 'pre' is also a common choice, especially for LSTMs/GRUs, as it keeps the most recent information at the end of the sequence.
# Padding configuration
max_sequence_len = 10 # Define a maximum length
padding_type = 'post'
truncation_type = 'post'
# Pad the sequences
padded_sequences = pad_sequences(sequences,
maxlen=max_sequence_len,
padding=padding_type,
truncating=truncation_type)
print("\nPadded Corpus Sequences (Tensor Shape: {}):".format(padded_sequences.shape))
print(padded_sequences)
The output padded_sequences is now a NumPy array (or TensorFlow tensor) with shape (number_of_samples, max_sequence_len). This is the format expected by an embedding layer in your RNN model. Index 0 is the default padding value.
A significant advantage of this pipeline is applying the same transformations to new, unseen data using the already fitted tokenizer.
# Convert new data to integer sequences using the *same* tokenizer
new_sequences = tokenizer.texts_to_sequences(new_data)
print("\nNew Data Integer Sequences:")
for seq in new_sequences:
print(seq) # Notice the OOV token (index 1) for words not in the original fit
# Pad the new sequences using the *same* parameters
new_padded_sequences = pad_sequences(new_sequences,
maxlen=max_sequence_len,
padding=padding_type,
truncating=truncation_type)
print("\nPadded New Data Sequences (Tensor Shape: {}):".format(new_padded_sequences.shape))
print(new_padded_sequences)
Notice how words like "grus" or "batching", which were not in the original corpus, are mapped to the oov_token index (1). This ensures consistent processing.
For better reusability, you can wrap these steps into a function or a class. Here's a functional approach:
def create_text_pipeline(training_texts, max_vocab_size=10000, oov_token="<OOV>"):
"""Fits a tokenizer on training texts."""
tokenizer = Tokenizer(num_words=max_vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(training_texts)
return tokenizer
def preprocess_texts(texts, tokenizer, max_len, padding='post', truncating='post'):
"""Applies tokenization and padding using a fitted tokenizer."""
sequences = tokenizer.texts_to_sequences(texts)
padded = pad_sequences(sequences, maxlen=max_len, padding=padding, truncating=truncating)
return padded
# --- Usage Example ---
# 1. Create pipeline (fit tokenizer) on training data
train_corpus = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]
pipeline_tokenizer = create_text_pipeline(train_corpus, max_vocab_size=100)
pipeline_maxlen = 10 # Set based on analysis or requirement
# 2. Process training data
train_padded = preprocess_texts(train_corpus, pipeline_tokenizer, pipeline_maxlen)
print("\n--- Pipeline Usage ---")
print("Processed Training Data Shape:", train_padded.shape)
# print(train_padded) # Optionally print the array
# 3. Process new/validation/test data
validation_data = ["This is another document to process."]
validation_padded = preprocess_texts(validation_data, pipeline_tokenizer, pipeline_maxlen)
print("Processed Validation Data Shape:", validation_padded.shape)
print(validation_padded)
padded_sequences tensor is now ready. The typical next step in a model is an Embedding layer. This layer will take these integer sequences and convert each integer into a dense vector representation. If you use padding value 0 (the default), configure the Embedding layer with mask_zero=True (in Keras) or ensure subsequent RNN layers handle masking appropriately, so the model learns to ignore these padded steps.maxlen: Choosing vocab_size and max_sequence_len involves trade-offs. Larger values capture more information but increase model size, memory usage, and computation. Analyze your data (e.g., sequence length distribution) to make informed choices.tf.data or PyTorch DataLoader help integrate such preprocessing steps efficiently into your data loading workflow.This practical pipeline provides a standard and repeatable way to prepare your text data, forming a fundamental step before training sophisticated sequence models like LSTMs and GRUs.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with