While LSTMs and GRUs significantly improve the ability of recurrent networks to handle long-range dependencies, the basic RNN structures we've seen so far (many-to-one for classification, many-to-many for prediction where input and output lengths often align) have limitations. Consider tasks like machine translation: the input sentence ("Hello world") and the output sentence ("Bonjour le monde") can have different lengths, and the relationship between words isn't always a simple one-to-one mapping across time steps. Similarly, summarizing a long document into a short paragraph requires processing a long input sequence to generate a much shorter output sequence.
For these types of problems, where the input and output sequences might differ in length and structure, the standard RNN approach isn't sufficient. We need a more flexible architecture. This is where the Encoder-Decoder architecture, often called the Sequence-to-Sequence (Seq2Seq) model, comes into play.
The core idea is elegant: divide the task into two distinct phases handled by two separate recurrent networks (or stacks of networks):
This separation allows the model to handle input and output sequences of different lengths. The encoder maps the variable-length input to a fixed-size context, and the decoder maps that fixed-size context back to a variable-length output.
We can visualize this flow as follows:
A high-level view of the Encoder-Decoder architecture. The Encoder processes the input sequence to produce a context vector, which then initializes the Decoder to generate the output sequence.
Let's think about the process intuitively:
<sos>
as its first input.<sos>
, later the previous output word), it predicts the first output word y1 and updates its state to h1dec.
(yt′,ht′dec)=RNNdec(inputt′,ht′−1dec)
(where inputt′ is <sos>
for t′=1, and yt′−1 for t′>1)<eos>
or reaches a predefined maximum length.The Encoder-Decoder architecture forms the foundation for many sophisticated sequence modeling tasks:
While powerful, the basic Encoder-Decoder architecture has a potential bottleneck: the single, fixed-size context vector c. For very long input sequences, compressing all necessary information into this single vector can be challenging. The decoder only gets this one summary to work with when generating the entire output. Intuitively, it might be helpful for the decoder to selectively focus on different parts of the input sequence as it generates different parts of the output.
This limitation motivates the development of Attention Mechanisms, which allow the decoder to dynamically look back at relevant parts of the encoder's hidden states (not just the final one) at each step of the generation process. We will briefly touch upon attention in the next section as it significantly enhances the performance of sequence-to-sequence models, especially for longer sequences.
© 2025 ApX Machine Learning