Recurrent Neural Networks, as we've discussed, process information sequentially, maintaining an internal state or memory. This design makes them well-suited for tasks involving sequences like text analysis or time series forecasting. However, just like other neural networks, RNNs operate on numerical data, specifically tensors. Raw sequence data, whether it's a sentence full of words or a list of temperature readings over time, needs to be transformed into this numerical format before an RNN can process it. This preparation step is essential for successful model training.
Let's look at the common techniques for preparing two primary types of sequence data: text and time series.
Text data requires several steps to convert sentences or documents into numerical sequences suitable for RNN layers like SimpleRNN
, LSTM
, or GRU
.
The first step is tokenization, which involves breaking down the text into smaller units called tokens. These tokens are typically words, but they can also be characters or sub-word units depending on the task and desired granularity.
For example, the sentence "Keras makes deep learning easy." could be tokenized into words: ["Keras", "makes", "deep", "learning", "easy", "."]
.
Keras provides utilities like the Tokenizer
class found in keras.preprocessing.text
(or potentially keras.layers.TextVectorization
in newer Keras versions integrated with TensorFlow/PyTorch backends) to handle this process efficiently.
from keras.preprocessing.text import Tokenizer
sentences = [
"Keras makes deep learning easy.",
"RNNs are great for sequences."
]
# Initialize the Tokenizer
# num_words limits the vocabulary size to the most frequent words
tokenizer = Tokenizer(num_words=100)
# Build the vocabulary based on the sentences
tokenizer.fit_on_texts(sentences)
# The tokenizer now has a word_index dictionary
print(tokenizer.word_index)
# Output (example): {'keras': 1, 'makes': 2, 'deep': 3, 'learning': 4, 'easy': 5,
# 'rnns': 6, 'are': 7, 'great': 8, 'for': 9, 'sequences': 10}
Once tokenized, we need to build a vocabulary: a unique mapping from each token (word) to an integer index. The Tokenizer
handles this automatically during the fit_on_texts
step. The word_index
attribute stores this mapping.
A common practice is to reserve index 0 for padding (discussed next) and potentially another index for "out-of-vocabulary" (OOV) tokens. These are words encountered during processing that were not present in the training text used to build the vocabulary.
After fitting the tokenizer, you can convert your text sequences into sequences of integers using the texts_to_sequences
method.
# Convert sentences to sequences of integers
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)
# Output (based on previous example): [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
# Note: Punctuation might be handled differently based on tokenizer settings.
RNNs, especially when processed in batches for efficiency, require input sequences to have a uniform length. However, real-world sentences or documents rarely have the same number of words. We address this using padding.
Padding involves adding a special padding token (usually represented by the integer 0) to shorter sequences until they reach a specified maximum length (maxlen
). You can add padding at the beginning (pre
) or end (post
) of the sequence. Post-padding is generally preferred as it allows the initial parts of the sequence to be processed first by the RNN without being dominated by padding values.
Keras provides the pad_sequences
utility for this.
from keras.preprocessing.sequence import pad_sequences
# Assume 'sequences' is the list of integer sequences from the previous step
# [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
# Define a maximum sequence length (can be determined from data or set manually)
maxlen = 8
# Pad the sequences
padded_sequences = pad_sequences(sequences, maxlen=maxlen, padding='post')
print(padded_sequences)
# Output:
# [[ 1 2 3 4 5 0 0 0]
# [ 6 7 8 9 10 0 0 0]]
These padded integer sequences are now almost ready. Typically, the next step within the model architecture itself is to use an Embedding
layer. This layer takes these integer sequences as input and transforms each integer index into a dense vector of a fixed size (the embedding dimension). These dense vectors capture semantic relationships between words and are learned during model training. We won't detail the Embedding
layer here, but it's important to know that these padded integer sequences are the direct input to such a layer in most NLP models using RNNs.
Time series data, such as stock prices, sensor readings, or weather measurements over time, requires a different preparation approach.
Neural networks generally perform better when input numerical data features are scaled to a standard range. Large fluctuations or differing scales between features can hinder the learning process. Common techniques include:
You should fit the scaler (e.g., MinMaxScaler
or StandardScaler
from Scikit-learn) only on the training data and then use the fitted scaler to transform both the training and validation/test data. This prevents information leakage from the validation/test sets into the training process.
RNNs learn from sequences. For a time series, we need to restructure the continuous data into fixed-length input sequences (windows) and corresponding target values.
Imagine a time series [x1,x2,x3,x4,x5,x6,...]. We can create supervised learning examples by using a window of past values to predict a future value. If we choose a window size (or lookback period) of 3 timesteps to predict the value at the next timestep, we generate pairs like:
This process essentially converts the time series forecasting problem into a supervised learning problem where the model learns to map a sequence of past observations to a future observation.
Creating input sequences (windows) and corresponding targets from a time series using a lookback window of size 3.
You can implement this windowing logic using simple Python loops or more efficiently with libraries like NumPy or Pandas.
Once prepared (tokenized/encoded/padded for text, or normalized/windowed for time series), the data needs to be shaped correctly for Keras RNN layers. These layers typically expect input data in a 3D tensor format:
(batch_size, timesteps, features)
Let's break this down:
batch_size
: The number of sequences processed together in one pass during training or inference. This dimension is often implicitly handled by Keras when using methods like fit
.timesteps
: The length of each sequence.
maxlen
used during padding.features
: The number of features representing the input at each timestep.
Embedding
layer, this is usually 1 (the integer index itself). The Embedding
layer then expands this into the embedding dimension.For example:
(32, 50, 1)
. (Though often, the last dimension is omitted, and Keras infers it or the Embedding layer handles the (32, 50)
shape directly).(64, 10, 3)
.Understanding this expected input shape is important when building your Keras models and ensuring your data pipeline feeds the RNN layers correctly. Properly preparing your sequence data through tokenization, encoding, padding, normalization, and windowing lays the foundation for training effective recurrent neural networks.
© 2025 ApX Machine Learning