Machine learning models, including the neural networks we'll focus on, operate on numbers, not raw text or abstract time points. As we discussed, sequential data carries meaning in its order and structure. The challenge, then, is to convert these sequences into numerical formats that models can process while preserving, as much as possible, the inherent sequential information. This process is a fundamental step in building sequence models.
Let's look at common strategies for representing different types of sequential data numerically.
Text is perhaps the most common type of sequential data encountered in natural language processing (NLP). Sentences are sequences of words, and documents are sequences of sentences. Here are the primary ways to turn text into numbers:
The most straightforward method is to map each unique word (or sometimes character) in your dataset to a specific integer.
["The", "cat", "sat", "."]
.<UNK>
for unknown words (encountered during testing but not seen in training) and <PAD>
for padding (which we'll discuss shortly).Example:
Suppose our vocabulary is: {"<PAD>": 0, "<UNK>": 1, "the": 2, "cat": 3, "sat": 4, "on": 5, "mat": 6, ".": 7}
The sentence "The cat sat." would be encoded as the integer sequence: [2, 3, 4, 7]
.
A sentence like "The dog sat." (if "dog" is not in the vocabulary) might become: [2, 1, 4, 7]
.
This method is simple and efficient but has a limitation: the integers themselves imply an arbitrary order and magnitude relationship between words (e.g., is "cat" (3) somehow "less than" "sat" (4)?). This relationship doesn't exist linguistically, and models might incorrectly learn patterns based on these arbitrary integer values.
To avoid the issue of implied order in integer encoding, one-hot encoding can be used. Each token is represented by a vector whose length is equal to the size of the vocabulary. This vector contains all zeros except for a single '1' at the index corresponding to that token in the vocabulary.
Example: Using the vocabulary above (size 8):
[0, 0, 1, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0]
[0, 0, 0, 0, 1, 0, 0, 0]
The sentence "The cat sat." becomes a sequence of these vectors:
[[0,0,1,0,0,0,0,0], [0,0,0,1,0,0,0,0], [0,0,0,0,1,0,0,0], [0,0,0,0,0,0,0,1]]
One-hot vectors for tokens "the", "cat", and "sat". Each vector has a '1' at the token's index and '0' elsewhere.
One-hot encoding solves the ordering problem of integer encoding, but it introduces two new issues:
A more advanced and widely used technique involves embedding layers. These layers learn dense, lower-dimensional vectors (called embeddings) for each token. Unlike one-hot vectors, these embeddings capture semantic relationships, meaning similar words tend to have similar vectors. We won't implement embeddings in this chapter, but it's important to know they exist as a powerful alternative to overcome the limitations of integer and one-hot encoding. We will cover them in detail in Chapter 8 when discussing data preparation pipelines.
Time series data, such as stock prices, sensor readings, or weather measurements, is often already numerical. However, some preprocessing is usually still required.
If your time series consists of single measurements at each time step (univariate), like daily closing stock prices, you might use the values directly. If you have multiple measurements at each time step (multivariate), like temperature, humidity, and pressure, you'd have a vector of numbers for each step.
However, just like in standard machine learning, it's generally beneficial to scale numerical features before feeding them into a neural network. Common techniques include:
Scaling helps gradient descent algorithms converge faster and prevents features with larger values from dominating the learning process.
RNNs are typically trained in a supervised manner. For time series forecasting, this often involves creating input/output pairs using a sliding window approach. You select a window size (number of past time steps) to use as input features and define how many future steps you want to predict as the output label.
Example:
Consider a time series: [10, 12, 11, 13, 14, 15, 16, 18]
Using a window size of 3 to predict the next single step:
[10, 12, 11]
-> Output: 13
[12, 11, 13]
-> Output: 14
[11, 13, 14]
-> Output: 15
[13, 14, 15]
-> Output: 16
[14, 15, 16]
-> Output: 18
This transforms the unsupervised time series into a supervised learning problem suitable for training sequence models.
A practical challenge arises because sequences naturally come in different lengths (e.g., sentences have varying numbers of words). Neural networks, however, typically require inputs to be of a fixed size, especially when processed in batches for efficiency.
Example: Maximum length = 6. Padding value = 0.
[2, 3, 4, 7]
-> Padded: [2, 3, 4, 7, 0, 0]
-> Mask: [1, 1, 1, 1, 0, 0]
[2, 1, 4, 5, 7]
-> Padded: [2, 1, 4, 5, 7, 0]
-> Mask: [1, 1, 1, 1, 1, 0]
(Here, 1 in the mask indicates a real token, 0 indicates padding).Once numerically represented and padded, sequential data is typically arranged into a 3D tensor for input to RNN layers in deep learning frameworks. The dimensions usually follow this convention:
(batch_size, time_steps, features)
batch_size
: The number of sequences processed together in one iteration of training.time_steps
: The length of the sequences (after padding). This is the dimension along which the RNN iterates.features
: The number of features representing the input at each time step.
Understanding these numerical representations and data structuring conventions is essential. It bridges the gap between raw sequential data and the mathematical operations performed inside recurrent neural networks. Having established how to prepare our data, we can now begin to explore the specialized network architectures designed to process these numerical sequences effectively, starting with the fundamental concepts of RNNs in the next chapter.
© 2025 ApX Machine Learning