Having established the architectures of Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRUs), let's examine how these models are practically applied to text-based tasks. Unlike methods such as TF-IDF or Bag-of-Words, which treat text as an unordered collection of terms, sequence models process words one by one, maintaining an internal state or "memory" that captures information from previous steps. This inherent ability to handle ordered data makes them suitable for a variety of NLP problems where context and word order are significant.
Preparing Text for Sequence Models
Before feeding text into an RNN, LSTM, or GRU, it needs to be transformed into a numerical format that the network can understand. This typically involves two main steps:
- Tokenization: The text is broken down into individual tokens (usually words, but sometimes subwords or characters). This process should be consistent with how the word embeddings, if used, were generated.
- Numerical Conversion: Each token is mapped to a numerical representation. While one-hot encoding is possible, it leads to very high-dimensional and sparse vectors. A much more common and effective approach is to use word embeddings.
- First, each unique token in the vocabulary is assigned an integer index.
- Then, an embedding layer (or a pre-trained embedding lookup) is used to convert each integer index into a dense, lower-dimensional vector (e.g., 100, 300 dimensions). These vectors, often learned using methods like Word2Vec or GloVe (as discussed in Chapter 4) or trained concurrently with the sequence model, capture semantic relationships between words.
The result is that an input sentence becomes a sequence of dense vectors, where each vector corresponds to a token in the sequence. For example, the sentence "Review this product" might be tokenized into ["Review", "this", "product"]
, mapped to indices [15, 8, 120]
, and then converted by the embedding layer into a sequence of three vectors: [vector_15, vector_8, vector_120]
. This sequence of vectors is then fed into the sequence model, one vector per time step.
Common NLP Tasks and Corresponding Architectures
Sequence models can be adapted to different NLP tasks by varying how inputs are processed and outputs are generated. Here are some common patterns:
Many-to-One Architecture
In this configuration, the model reads an entire sequence of inputs and produces a single output after processing the final input step. The final hidden state (or sometimes an aggregation like max-pooling or average-pooling of all hidden states) is typically fed into a dense layer with an appropriate activation function (e.g., sigmoid for binary classification, softmax for multi-class classification) to produce the final prediction.
- Applications:
- Sentiment Analysis: Classifying a review, tweet, or sentence as positive, negative, or neutral based on the entire text.
- Text Classification: Assigning a document to one or more predefined categories (e.g., topic classification, spam detection).
Many-to-One architecture: The sequence model processes inputs step-by-step, and the final state contributes to a single classification output.
Many-to-Many (Synced) Architecture
Here, the model processes an input sequence and generates an output at each time step. The length of the output sequence is typically the same as the input sequence. The hidden state at each time step t is used to predict the output yt for that specific step, often via a shared dense layer applied independently at each step.
- Applications:
- Part-of-Speech (POS) Tagging: Assigning a grammatical tag (noun, verb, adjective, etc.) to each word in a sentence.
- Named Entity Recognition (NER): Identifying and classifying named entities (like person names, organizations, locations) in text for each token.
Many-to-Many (Synced) architecture: An output is produced for each input time step, often used for tagging tasks.
Many-to-Many (Encoder-Decoder) Architecture
This architecture is designed for sequence-to-sequence (seq2seq) tasks where the input and output sequences can have different lengths. It consists of two main components:
- Encoder: An RNN (or LSTM/GRU) processes the entire input sequence and compresses its information into a fixed-size context vector (often the final hidden state or a combination of hidden states).
- Decoder: Another RNN (or LSTM/GRU) takes the context vector from the encoder as its initial state and generates the output sequence step-by-step. At each step, it predicts the next token based on the context vector and the tokens it has generated so far.
- Applications:
- Machine Translation: Translating a sentence from one language to another.
- Text Summarization: Generating a concise summary of a longer document.
- Question Answering: Generating an answer based on a given context passage and question.
While the basic encoder-decoder structure is powerful, more advanced versions often incorporate attention mechanisms, which allow the decoder to selectively focus on different parts of the input sequence while generating each output token. This significantly improves performance on longer sequences.
Practical Considerations
- Choice of Unit: While simple RNNs demonstrate the core concept, LSTMs and GRUs are far more common in practice due to their ability to handle long-range dependencies and mitigate the vanishing gradient problem. GRUs are slightly simpler than LSTMs (fewer parameters) and often perform comparably, making them a good alternative.
- Padding: Since sequence models typically process batches of data for efficiency, and sequences in a batch often have different lengths, it's necessary to pad shorter sequences with a special padding token (and often use masking) so that all sequences in the batch have the same length. The model should be configured to ignore these padding tokens during computation.
- Bidirectionality: For tasks like sentiment analysis or NER, information from future words can be just as important as information from past words. Bidirectional RNNs/LSTMs/GRUs process the input sequence in both forward and backward directions using two separate hidden layers. The outputs from both directions are typically concatenated at each time step, providing a richer representation of the context around each word.
- Stacking Layers: Just like feed-forward networks, RNN layers can be stacked (deep RNNs) to learn more complex representations of the sequence. The output sequence of one RNN layer becomes the input sequence for the next layer.
Sequence models represent a fundamental shift from frequency-based text analysis to methods that acknowledge the ordered nature of language. By processing text step-by-step and maintaining an internal state, RNNs, LSTMs, and GRUs provide the foundation for tackling complex NLP tasks that require understanding context and dependencies within the text. The next section provides a hands-on exercise to solidify these concepts by building a simple sequence model.