As highlighted in the chapter introduction, the standard self-attention mechanism's primary computational bottleneck is its quadratic complexity concerning the input sequence length N. Calculating the attention score matrix via A=softmax(dkQKT) involves computing the N×N matrix QKT. This requires N2 dot products between query and key vectors (each of dimension dk), followed by scaling, softmax normalization, and finally multiplication by the value matrix V. The overall complexity is typically dominated by the QKT calculation, resulting in O(N2dk) time and O(N2) space complexity for the attention matrix itself. This scaling behavior makes it computationally prohibitive to apply standard Transformers to very long sequences (e.g., high-resolution images treated as sequences of patches, long documents, or genomic data).
To overcome this limitation, a significant line of research has focused on developing Linear Transformers, variants that approximate the self-attention mechanism with linear, O(N), time and memory complexity. The core idea is to avoid the explicit computation and storage of the full N×N attention matrix.
If we consider the attention computation without the softmax normalization, Ounnorm=(QKT)V, matrix multiplication associativity allows us to reorder the computation as Ounnorm=Q(KTV). Here, KT is dk×N and V is N×dv. Computing KTV takes O(Ndkdv) operations. Multiplying the result by Q (N×dk) then takes another O(Ndkdv) operations. If the embedding dimension d=dk=dv is considered fixed or much smaller than N, the overall complexity appears linear in N.
However, the row-wise softmax function, defined as: softmax(Z)i=∑jexp(Zj)exp(Zi) where Zi represents the i-th row of the input matrix (in our case, the i-th row of dkQKT), prevents this simple reordering. The normalization term ∑jexp(…) couples all elements within a row, seemingly requiring the computation of all N dot products for a given query before normalization and multiplication by values.
Linear Transformer variants employ different strategies to approximate the attention mechanism and bypass the softmax bottleneck, effectively enabling a rearrangement similar to Q(KTV):
Kernel Approximation: This approach approximates the dot-product similarity followed by the exponential function inherent in softmax, often viewing exp(qiTkj/dk) as a kernel function. If this kernel can be approximated by an inner product of feature maps ϕ(⋅), such that K(qi,kj)≈ϕ(qi)Tϕ(kj), the attention sum ∑jK(qi,kj)vj can be rewritten: ∑j(ϕ(qi)Tϕ(kj))vj=ϕ(qi)T(∑jϕ(kj)vjT) (Note: Vector/matrix dimensions need careful handling here, this shows the conceptual reordering). The term ∑jϕ(kj)vjT computes an aggregate representation based on keys and values. This aggregate can be computed once (linearly in N) and then multiplied by the feature map of each query ϕ(qi) (again, linearly in N). The challenge lies in finding suitable feature maps ϕ that accurately approximate the exponential kernel while allowing this decomposition. Models like the Performer use techniques based on random features (specifically, Fast Attention via Positive Orthogonal Random Features or FAVOR+) to construct these mappings with theoretical guarantees on the approximation quality. These methods aim for O(N) complexity. We will examine kernel-based methods more closely in the section "Kernel-Based Attention Approximation (Performers)".
Low-Rank Projection: This strategy is based on the observation or assumption that the attention matrix A=softmax(QKT/dk) is often low-rank or can be well-approximated by a low-rank matrix. The Linformer model exemplifies this by introducing learnable projection matrices E (k×N) and F (k×N), where k is a fixed dimension significantly smaller than N (k≪N). It computes projected key and value matrices K′=EK and V′=FV. These projected matrices have dimensions k×dk and k×dv, respectively. The attention mechanism then operates using these projected matrices, for instance, by computing attention weights based on Q(K′)T (N×k) and applying them to V′ (k×dv). The complexity reduces from O(N2) to O(Nk), achieving linear scaling if k is constant or grows sub-linearly with N. This approach will be detailed in the section "Low-Rank Projection Methods (Linformer)".
Linear attention mechanisms offer a compelling advantage in terms of computational efficiency, particularly for long sequences.
Feature | Standard Attention | Linear Attention (Approx.) |
---|---|---|
Time Complexity | O(N2d) | O(Ndr) (where r depends on method, e.g., feature dim or projection size k) |
Memory (Attn Matrix) | O(N2) | O(Nr) or less |
Computation | Explicit N×N matrix | Avoids explicit N×N matrix |
Exactness | Exact | Approximate |
Suitability | Short to moderate sequences | Long sequences |
The primary trade-off is that these methods compute an approximation of the full self-attention mechanism. While empirical results often show that these approximations perform remarkably well, sometimes even outperforming standard attention on specific tasks (potentially due to implicit regularization effects), they might not capture exactly the same relational information as the original formulation. The quality of the approximation and its impact on downstream task performance must be evaluated empirically.
By reducing the computational burden from quadratic to linear, these methods significantly expand the applicability of Transformer architectures to domains previously hindered by sequence length constraints, enabling models to process entire documents, high-resolution images, or long time-series data more effectively.
© 2025 ApX Machine Learning