Masterclass
While RNNs, LSTMs, and GRUs process sequences element by element, many real-world problems require mapping an input sequence of one length to an output sequence of a potentially different length. Consider machine translation (translating a sentence from French to English) or text summarization (condensing a long article into a few sentences). The input and output lengths are often unrelated. A standard RNN architecture, which typically produces one output for each input, isn't directly suited for these tasks.
To address this, the Sequence-to-Sequence (seq2seq) framework was developed, primarily using recurrent architectures like LSTMs or GRUs. The core idea is to use two separate RNNs: one to process the input sequence (the Encoder) and another to generate the output sequence (the Decoder).
The seq2seq model consists of two main components:
Encoder: This RNN reads the input sequence, token by token (e.g., words or subwords). Its goal is not to produce an output at each step but to compress the entire input sequence's information into a fixed-size vector representation. This vector is often called the "context vector" or "thought vector," typically represented by the final hidden state (and cell state, for LSTMs) of the encoder RNN.
Decoder: This RNN takes the context vector produced by the encoder as its initial hidden state. It then generates the output sequence, token by token. At each step t, the decoder receives the context vector, its own previous hidden state ht−1​, and the previously generated output token yt−1​ as input to produce the next output token yt​ and update its hidden state to ht​. The generation process usually starts with a special start-of-sequence <SOS>
token and continues until an end-of-sequence <EOS>
token is produced or a maximum length is reached.
High-level structure of a Sequence-to-Sequence model using RNNs. The Encoder processes the input to create a context vector, which initializes the Decoder to generate the output sequence.
The encoder processes the input sequence X=(x1​,x2​,...,xn​) and outputs a context vector c. This vector c aims to summarize the entire input sequence.
c=Encoder(x1​,x2​,...,xn​)Typically, for an LSTM, c would be the final hidden state hn​ and cell state Cn​.
The decoder is initialized with this context (e.g., h0dec​=hnenc​, C0dec​=Cnenc​). It then generates the output sequence Y=(y1​,y2​,...,ym​) one token at a time. The probability of the next token yt​ depends on the context c, the previous token yt−1​, and the decoder's current hidden state htdec​:
P(yt​∣y1​,...,yt−1​,c)=Decoder(yt−1​,ht−1dec​,c)The first input to the decoder is usually a special <SOS>
token (y0​=<SOS>).
Let's outline simplified Encoder and Decoder modules using PyTorch's nn.LSTM
.
import torch
import torch.nn as nn
class EncoderRNN(nn.Module):
def __init__(self, input_size, hidden_size, num_layers=1):
super(EncoderRNN, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
# Note: Assuming input_size is the embedding dimension
self.embedding = nn.Embedding(input_size, hidden_size)
self.lstm = nn.LSTM(
hidden_size, hidden_size, num_layers, batch_first=True
)
def forward(self, input_seq):
# input_seq shape: (batch_size, seq_length)
embedded = self.embedding(input_seq)
# embedded shape: (batch_size, seq_length, hidden_size)
# Initialize hidden state (not shown for simplicity,
# defaults to zeros)
# hidden = self.init_hidden(batch_size)
outputs, (hidden, cell) = self.lstm(embedded)
# outputs shape: (batch_size, seq_length, hidden_size)
# hidden shape: (num_layers, batch_size, hidden_size)
# cell shape: (num_layers, batch_size, hidden_size)
# We typically use the final hidden and cell states as context
return hidden, cell
class DecoderRNN(nn.Module):
def __init__(self, hidden_size, output_size, num_layers=1):
super(DecoderRNN, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
# Note: output_size is the vocabulary size for the target
# language
self.embedding = nn.Embedding(output_size, hidden_size)
self.lstm = nn.LSTM(
hidden_size, hidden_size, num_layers, batch_first=True
)
self.out = nn.Linear(hidden_size, output_size)
self.softmax = nn.LogSoftmax(dim=1) # Often used with NLLLoss
def forward(self, input_token, hidden, cell):
# input_token shape: (batch_size, 1) -> single token
# hidden shape: (num_layers, batch_size, hidden_size)
# cell shape: (num_layers, batch_size, hidden_size)
embedded = self.embedding(input_token)
# embedded shape: (batch_size, 1, hidden_size)
# The context vector (encoder's final hidden/cell states)
# is passed as the initial hidden/cell state here.
output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
# output shape: (batch_size, 1, hidden_size)
# Reshape output to be (batch_size, hidden_size) for the
# linear layer
output = output.squeeze(1)
output = self.out(output)
# output shape: (batch_size, output_size)
# Optional: Apply softmax to get
# probabilities/log-probabilities
# output = self.softmax(output)
return output, hidden, cell
# Example Usage
# input_vocab_size = 10000
# output_vocab_size = 12000
# hidden_dim = 256
# n_layers = 2
# batch_size = 32
# input_length = 50
# encoder = EncoderRNN(input_vocab_size, hidden_dim, n_layers)
# decoder = DecoderRNN(hidden_dim, output_vocab_size, n_layers)
# Example input batch (indices)
# input_tensor = torch.randint(
# 0, input_vocab_size, (batch_size, input_length)
# )
# Pass through encoder
# encoder_hidden, encoder_cell = encoder(input_tensor)
# Decoder input starts with <SOS> token (assume index 0)
# decoder_input = torch.full((batch_size, 1), 0, dtype=torch.long)
# decoder_hidden = encoder_hidden # Use encoder's final hidden state
# decoder_cell = encoder_cell # Use encoder's final cell state
# Generate output sequence step-by-step (simplified loop)
# max_target_length = 60
# all_decoder_outputs = []
# for _ in range(max_target_length):
# decoder_output, decoder_hidden, decoder_cell = decoder(
# decoder_input, decoder_hidden, decoder_cell
# )
# all_decoder_outputs.append(decoder_output)
#
# # Get the most likely next token (greedy decoding)
# _, top_idx = decoder_output.topk(1)
# # Use predicted token as next input
# decoder_input = top_idx.detach()
The standard Encoder-Decoder architecture using RNNs proved effective for many tasks. However, it relies on compressing the entire input sequence into a single, fixed-size context vector. This creates an information bottleneck, particularly problematic for long input sequences. It becomes difficult for the model to remember details from the beginning of a long input when generating the end of the output sequence.
This limitation was a significant motivation for the development of attention mechanisms. Attention allows the decoder to selectively focus on different parts of the input sequence at each step of the output generation process, rather than relying solely on the single context vector. This ability to look back at the relevant parts of the source input dramatically improved performance on tasks like machine translation and paved the way for the Transformer architecture, which we will explore in the next chapter.
© 2025 ApX Machine Learning