After breaking down your text into tokens and establishing a consistent vocabulary, the next logical step is to convert these sequences of tokens into sequences of numbers. Machine learning models, including RNNs, operate on numerical data, not strings. Integer encoding is the process of replacing each token in your sequence with its corresponding unique integer ID from the vocabulary you created.
Think of the vocabulary you built in the previous step as a lookup table or a dictionary. Each unique token (word, subword, or character) is a key, and its assigned integer is the value.
For example, suppose we processed the sentence "the quick brown fox" and built the following simple vocabulary:
vocabulary = {'<UNK>': 0, '<PAD>': 1, 'the': 2, 'quick': 3, 'brown': 4, 'fox': 5, 'jumps': 6}
Here, <UNK>
represents unknown tokens (those not seen during vocabulary creation), and <PAD>
represents padding tokens, which we'll discuss later for handling variable sequence lengths.
The tokenized sequence is:
['the', 'quick', 'brown', 'fox']
To perform integer encoding, we iterate through this token sequence and replace each token with its integer ID from the vocabulary
:
2
3
4
5
The resulting integer-encoded sequence is:
[2, 3, 4, 5]
This numerical sequence retains the original order of the tokens, which is fundamental for sequence models.
Mapping tokens to integer IDs using the vocabulary.
This transformation is applied systematically to every sequence in your dataset, ensuring all text inputs are converted into a numerical format that models can process.
In Python, this mapping is often done using dictionary lookups. Given a list of tokenized sequences and a vocabulary dictionary, you can convert them like this:
# Sample vocabulary (already built)
vocabulary = {'<UNK>': 0, '<PAD>': 1, 'the': 2, 'quick': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9}
# Sample tokenized sentences
token_sequences = [
['the', 'quick', 'brown', 'fox'],
['the', 'lazy', 'dog'],
['fox', 'jumps', 'over', 'dog']
]
# Perform integer encoding
encoded_sequences = []
for seq in token_sequences:
encoded_seq = [vocabulary.get(token, vocabulary['<UNK>']) for token in seq]
encoded_sequences.append(encoded_seq)
# Print the results
# encoded_sequences will be:
# [[2, 3, 4, 5], [2, 8, 9], [5, 6, 7, 9]]
print(encoded_sequences)
Notice the use of vocabulary.get(token, vocabulary['<UNK>'])
. This is a common pattern: it tries to find the token in the vocabulary. If the token exists, its ID is returned. If the token is not found (it's an "out-of-vocabulary" or OOV token), it defaults to returning the ID assigned to the special <UNK>
token (in this case, 0
). This ensures that your process doesn't crash when encountering new words during testing or deployment, although how well the model handles many unknowns depends on its training.
This step is essential because mathematical models cannot directly process strings. By converting tokens to integers, we create a numerical representation that can be fed into the subsequent layers of a neural network.
It's important to understand that these integer IDs (2, 3, 4, 5, etc.) are arbitrary identifiers. The model doesn't initially know that the integer 3
('quick') is semantically closer to 5
('fox') than it is to 9
('dog'). The numerical value of the ID itself doesn't carry relational meaning between words. This limitation is addressed by the next step in the pipeline: embedding layers.
These integer sequences form the input that will typically be passed to an embedding layer, which we introduce next. The embedding layer will learn dense vector representations for each integer ID, capturing semantic relationships between tokens during model training.
© 2025 ApX Machine Learning