As we saw in the previous chapter, recurrent neural networks (RNNs), including LSTMs and GRUs, process sequences step-by-step. Each hidden state $h_t$ is computed based on the input $x_t$ and the previous hidden state $h_{t-1}$ . This sequential nature, while intuitive for modeling sequences, introduces significant challenges, particularly when dealing with the scale required for large language models.

The Bottlenecks of Recurrence

The core limitations stem directly from this sequential processing:

Limited Parallelization: The computation of $h_t$ must wait for $h_{t-1}$ to be computed. This dependency prevents parallelizing the computation across the time dimension (sequence length) within a single training example. While you can parallelize across different sequences in a batch, the processing within each sequence remains inherently sequential. For very long sequences, this becomes a major computational bottleneck, limiting training speed. Imagine processing a document with thousands of words; the calculation must proceed one word at a time.
Difficulty with Long-Range Dependencies: Information from early parts of a sequence must travel through the entire chain of recurrent connections to influence the processing of later parts. While LSTMs and GRUs were designed with gating mechanisms to mitigate the vanishing gradient problem, maintaining precise information across very long distances remains challenging. The path length for information flow between two distant tokens $x_i$ and $x_j$ is proportional to $|i - j|$ . This means gradients can still diminish or explode over long paths, making it difficult for the model to learn relationships between words far apart in the sequence.

Consider a simplified RNN update:

import torch

# Placeholder input sequence (batch_size=1, seq_len=5, features=10)
input_seq = torch.randn(1, 5, 10)
# Initial hidden state (batch_size=1, hidden_size=20)
h_prev = torch.zeros(1, 20)
# Simple RNN cell (not actual implementation)
rnn_cell = lambda input_t, h_prev: torch.tanh(
    input_t @ torch.randn(10, 20) + h_prev @ torch.randn(20, 20)
) # Simplified

hidden_states = []
# Sequential processing loop
for t in range(input_seq.shape[1]): # Loop over sequence length
    input_t = input_seq[:, t, :]
    h_t = rnn_cell(input_t, h_prev)
    hidden_states.append(h_t)
    h_prev = h_t # Update hidden state for the next step

# hidden_states now contains the state for each time step
# Note: The computation for time 't' depends explicitly on the result
# from 't-1'

This loop highlights the sequential dependency. We cannot calculate h_t for t=3 before calculating it for t=2.

Introducing Attention: Direct Connections

The Transformer architecture fundamentally breaks this sequential chain by removing recurrence altogether. Instead of passing information step-by-step, it uses attention mechanisms. The central idea behind attention is to allow the model, when processing one element in the sequence (e.g., a word), to directly look at and draw information from all other elements in the sequence.

Imagine you are translating the sentence "The cat sat on the mat". When processing the word "sat", an attention mechanism allows the model to directly assess the relevance of "The", "cat", "on", "the", and "mat" to understand the context of "sat". It computes a set of attention scores, representing the importance of each other word to the current word. These scores are then used to create a weighted sum of the other words' representations, providing a contextually informed representation for "sat".

Crucially, this "looking" process doesn't depend on the distance between words in the sequence. The model can establish a direct connection between "The" and "mat" just as easily as between "on" and "the". The path length between any two tokens in the sequence becomes constant, effectively $O(1)$ , as the attention mechanism calculates pairwise interactions directly. This dramatically simplifies the learning of long-range dependencies compared to the $O(n)$ path length in RNNs for a sequence of length $n$ .

Enabling Parallelism

By removing the sequential dependency ( $h_t$ depending on $h_{t-1}$ ), the Transformer enables massive parallelization across the sequence length. The computations required to generate the representation for each token can, in theory, be performed simultaneously. Although there are dependencies within the calculation for a single token (e.g., calculating attention scores before applying them), the overall computation for the entire sequence's representations within a layer can be parallelized much more effectively than in an RNN. This property is fundamental to training the extremely large models that have become prevalent.

The shift from recurrence to attention can be visualized as moving from a chain structure to a fully connected graph (within a layer), where every token can directly interact with every other token.

Comparison of information flow. In RNNs (left), information flows sequentially. In attention-based models like the Transformer (right), each output can directly attend to all inputs simultaneously.

This fundamental shift from sequential processing to parallelizable attention mechanisms is the primary reason the Transformer architecture has been so successful for large-scale sequence modeling. The following sections will examine the specific mechanisms, like Scaled Dot-Product Attention and Multi-Head Attention, that implement this concept effectively.

Was this section helpful?