Recurrent models operate on numerical data, yet text arrives as strings of characters. The initial, fundamental step in processing text for sequence models is to break it down into manageable units, called tokens, and then create a consistent way to represent these tokens numerically. This process involves tokenization and vocabulary building.
Tokenization is the process of segmenting a piece of text into smaller units, or tokens. Think of it like chopping a sentence into words or characters. The choice of tokenization strategy significantly impacts how the model "sees" the text and can affect performance.
Common strategies include:
Word Tokenization: This is perhaps the most intuitive approach. Text is split into words, often based on whitespace and punctuation. For example, the sentence "RNNs process sequences." might be tokenized into ["RNNs", "process", "sequences", "."]
. Considerations here include how to handle punctuation (keep it, discard it, split it?), case sensitivity (convert everything to lowercase?), and contractions (split "don't" into "do" and "n't"?). For many natural language processing tasks, word-level tokenization provides a good balance between sequence length and meaning representation.
Character Tokenization: Here, the text is split into individual characters. The same sentence "RNNs process sequences." becomes ['R', 'N', 'N', 's', ' ', 'p', 'r', 'o', 'c', 'e', 's', 's', ' ', 's', 'e', 'q', 'u', 'e', 'n', 'c', 'e', 's', '.']
. This results in a much smaller vocabulary (just letters, numbers, punctuation, whitespace) and naturally handles misspellings or rare words. However, it leads to much longer sequences, potentially making it harder for models like simple RNNs to capture long-distance dependencies.
Subword Tokenization: This is a hybrid approach gaining popularity, used by many modern NLP models (like BERT, GPT). Algorithms like Byte Pair Encoding (BPE) or WordPiece break words into smaller, meaningful sub-units. For instance, "tokenization" might become ["token", "ization"]
. This allows the model to handle rare words by composing them from known subwords, keeps the vocabulary size manageable, and avoids excessively long sequences compared to character tokens. While powerful, implementing these often involves specialized libraries.
For most foundational work with RNNs, word tokenization is a common starting point.
Once you have tokenized your text corpus (your entire training dataset), the next step is to build a vocabulary. The vocabulary is the definitive set of all unique tokens encountered in the training data. We then create a mapping, typically a dictionary or hash map, that assigns a unique integer index to each unique token.
Why integers? Neural networks perform mathematical operations, and they need numerical inputs. This mapping translates our symbolic tokens into numbers the network can understand.
Consider this small dataset:
Tokenize:
["the", "cat", "sat"]
["the", "dog", "sat"]
Collect Unique Tokens: The unique tokens are {"the", "cat", "sat", "dog"}
.
Build Vocabulary Mapping: Assign an integer ID to each unique token. A common practice is to reserve index 0 for padding (discussed later) and sometimes index 1 for unknown words.
Vocabulary:
{
"<PAD>": 0, // Reserved for padding
"<UNK>": 1, // Reserved for unknown words
"the": 2,
"cat": 3,
"sat": 4,
"dog": 5
}
We often include special tokens in the vocabulary:
<PAD>
(Padding Token): Used to make sequences in a batch have the same length. We'll cover this in the "Padding Sequences" section.<UNK>
or <OOV>
(Unknown/Out-of-Vocabulary Token): Represents tokens encountered during testing or inference that were not present in the training data vocabulary. This is important because real-world data often contains words the model hasn't seen before. Mapping them to a dedicated <UNK>
token allows the model to process them, albeit without specific learned knowledge about that exact word.<SOS>
/ <EOS>
(Start/End of Sequence Tokens): Sometimes used, especially in sequence generation tasks, to signal the beginning or end of a sequence.The size of the vocabulary is a design choice. A larger vocabulary captures more words but increases the model's embedding layer size (more parameters to learn). Often, very infrequent words (appearing only once or twice in a large corpus) are mapped to <UNK>
to keep the vocabulary size reasonable and potentially improve generalization.
A simplified illustration of mapping tokens (including special ones like
<PAD>
and<UNK>
) to unique integer IDs in a vocabulary.
Tokenization and vocabulary creation are the essential first steps in transforming raw text into a format suitable for the next stage: converting sequences of tokens into sequences of integers using this vocabulary map, a process known as integer encoding.
© 2025 ApX Machine Learning