Masterclass
As we saw in the previous chapter, recurrent neural networks (RNNs), including LSTMs and GRUs, process sequences step-by-step. Each hidden state ht​ is computed based on the input xt​ and the previous hidden state ht−1​. This sequential nature, while intuitive for modeling sequences, introduces significant challenges, particularly when dealing with the scale required for large language models.
The core limitations stem directly from this sequential processing:
Limited Parallelization: The computation of ht​ must wait for ht−1​ to be computed. This dependency prevents parallelizing the computation across the time dimension (sequence length) within a single training example. While you can parallelize across different sequences in a batch, the processing within each sequence remains inherently sequential. For very long sequences, this becomes a major computational bottleneck, limiting training speed. Imagine processing a document with thousands of words; the calculation must proceed one word at a time.
Difficulty with Long-Range Dependencies: Information from early parts of a sequence must travel through the entire chain of recurrent connections to influence the processing of later parts. While LSTMs and GRUs were designed with gating mechanisms to mitigate the vanishing gradient problem, maintaining precise information across very long distances remains challenging. The path length for information flow between two distant tokens xi​ and xj​ is proportional to ∣i−j∣. This means gradients can still diminish or explode over long paths, making it difficult for the model to learn relationships between words far apart in the sequence.
Consider a simplified RNN update:
import torch
# Placeholder input sequence (batch_size=1, seq_len=5, features=10)
input_seq = torch.randn(1, 5, 10)
# Initial hidden state (batch_size=1, hidden_size=20)
h_prev = torch.zeros(1, 20)
# Simple RNN cell (not actual implementation)
rnn_cell = lambda input_t, h_prev: torch.tanh(
input_t @ torch.randn(10, 20) + h_prev @ torch.randn(20, 20)
) # Simplified
hidden_states = []
# Sequential processing loop
for t in range(input_seq.shape[1]): # Loop over sequence length
input_t = input_seq[:, t, :]
h_t = rnn_cell(input_t, h_prev)
hidden_states.append(h_t)
h_prev = h_t # Update hidden state for the next step
# hidden_states now contains the state for each time step
# Note: The computation for time 't' depends explicitly on the result
# from 't-1'
This loop highlights the sequential dependency. We cannot calculate h_t
for t=3
before calculating it for t=2
.
The Transformer architecture fundamentally breaks this sequential chain by removing recurrence altogether. Instead of passing information step-by-step, it uses attention mechanisms. The central idea behind attention is to allow the model, when processing one element in the sequence (e.g., a word), to directly look at and draw information from all other elements in the sequence.
Imagine you are translating the sentence "The cat sat on the mat". When processing the word "sat", an attention mechanism allows the model to directly assess the relevance of "The", "cat", "on", "the", and "mat" to understand the context of "sat". It computes a set of attention scores, representing the importance of each other word to the current word. These scores are then used to create a weighted sum of the other words' representations, providing a contextually informed representation for "sat".
Crucially, this "looking" process doesn't depend on the distance between words in the sequence. The model can establish a direct connection between "The" and "mat" just as easily as between "on" and "the". The path length between any two tokens in the sequence becomes constant, effectively O(1), as the attention mechanism calculates pairwise interactions directly. This dramatically simplifies the learning of long-range dependencies compared to the O(n) path length in RNNs for a sequence of length n.
By removing the sequential dependency (ht​ depending on ht−1​), the Transformer enables massive parallelization across the sequence length. The computations required to generate the representation for each token can, in theory, be performed simultaneously. Although there are dependencies within the calculation for a single token (e.g., calculating attention scores before applying them), the overall computation for the entire sequence's representations within a layer can be parallelized much more effectively than in an RNN. This property is fundamental to training the extremely large models that have become prevalent.
The shift from recurrence to attention can be visualized as moving from a chain structure to a fully connected graph (within a layer), where every token can directly interact with every other token.
Comparison of information flow. In RNNs (left), information flows sequentially. In attention-based models like the Transformer (right), each output can directly attend to all inputs simultaneously.
This fundamental shift from sequential processing to parallelizable attention mechanisms is the primary reason the Transformer architecture has been so successful for large-scale sequence modeling. The following sections will examine the specific mechanisms, like Scaled Dot-Product Attention and Multi-Head Attention, that implement this concept effectively.
Was this section helpful?
© 2025 ApX Machine Learning