While the standard Transformer architecture processes sequences in fixed-length, independent chunks or windows, this approach encounters a significant limitation known as context fragmentation. When dealing with long documents or sequential data streams that exceed the fixed window size, the model must process the input in separate segments. Information from preceding segments is typically lost when processing a new segment, hindering the model's ability to capture long-range dependencies that span across these segment boundaries.
Transformer-XL (meaning Transformer with eXtra Long context) directly addresses this limitation by introducing a recurrence mechanism at the segment level. Instead of processing each segment in isolation, Transformer-XL reuses the hidden states computed from previous segments.
The core idea is straightforward yet effective. When the model processes a segment, say segment τ, it computes a sequence of hidden states at each layer, similar to the standard Transformer. These hidden states are then cached or stored in memory. When the model moves to the next segment, τ+1, the layers can attend not only to the hidden states within the current segment τ+1 but also to the cached hidden states from the previous segment τ.
Let hτn∈RL×d denote the sequence of hidden states produced by the n-th Transformer layer for the τ-th segment, where L is the segment length and d is the hidden dimension. When computing the hidden states for the next segment, hτ+1n, the (n)-th layer receives inputs derived from hτ+1n−1 (the output of the layer below for the current segment) and hτn−1 (the output of the layer below for the previous segment).
Specifically, the extended context for layer n at segment τ+1 is formed by concatenating the cached states from the previous segment with the states from the current segment along the sequence length dimension:
h~τ+1n−1=[SG(hτn−1)∘hτ+1n−1]Here, SG(⋅) denotes a stop-gradient operation, meaning that gradients are not backpropagated through the cached states hτn−1. This is important: it prevents the computational graph from growing excessively long and avoids the associated optimization difficulties. The ∘ operator signifies concatenation along the sequence length dimension.
The attention mechanism within layer n then computes its Queries (Q) based solely on the current segment's representations hτ+1n−1, while the Keys (K) and Values (V) are derived from the extended context h~τ+1n−1:
Qτ+1n=hτ+1n−1WqnKτ+1n=h~τ+1n−1WknVτ+1n=h~τ+1n−1Wvn AttentionOutputτ+1n=Attention(Qτ+1n,Kτ+1n,Vτ+1n)This allows each position in the current segment τ+1 to attend to positions within itself and also to all positions in the preceding segment τ, effectively doubling the context length available at each step without propagating gradients across segment boundaries.
Flow of information in Transformer-XL. Hidden states from segment τ are cached and used as extended context for processing segment τ+1, without backpropagating gradients through the cache.
The state reuse mechanism introduces a challenge for standard positional encodings (like sinusoidal or learned absolute embeddings described in Chapter 4). If we simply add the same absolute positional encoding to each segment, a position index (e.g., the 10th token) would have the same encoding regardless of whether it's the 10th token of the first segment or the 10th token of the second segment. This positional ambiguity makes it difficult for the model to distinguish the temporal order across segments.
Transformer-XL solves this by employing a relative positional encoding scheme. Instead of encoding the absolute position i of a token, it encodes the relative distance (or offset) i−j between the query token at position i and the key token at position j. This relative information is injected directly into the attention score calculation.
In the standard self-attention score calculation for query qi and key kj, we compute qiTkj. In Transformer-XL with relative positional encoding, this calculation is modified to incorporate terms that depend only on the relative distance i−j. The exact formulation involves replacing the absolute positional information within the key vectors with relative positional embeddings. This ensures that the attention mechanism is aware of the distance between tokens, irrespective of their absolute positions within potentially very long sequences processed segment by segment.
The introduction of segment-level recurrence and relative positional encoding offers several benefits:
Transformer-XL represents a significant step in enabling Transformers to handle much longer sequences effectively, paving the way for applications involving lengthy documents, articles, or continuous data streams where maintaining long-range coherence is important. While it introduces the overhead of caching states, the benefits in terms of modeling capability and evaluation speed often outweigh this cost for specific tasks.
© 2025 ApX Machine Learning