The quadratic complexity of standard self-attention, O(N2d), where N is the sequence length and d is the model dimension, represents a significant bottleneck for processing long sequences. While powerful, calculating attention scores between every pair of tokens becomes computationally prohibitive as N increases.
Several approaches aim to alleviate this computational burden. One prominent direction involves approximating the attention mechanism using low-rank projections, as exemplified by the Linformer model.
The Low-Rank Hypothesis
The core idea behind Linformer is based on the hypothesis that the self-attention mechanism, despite its potential to model complex interactions across the entire sequence, can often be approximated by a low-rank matrix. In essence, the N×N attention matrix P=Softmax(dkQKT) might be structurally redundant. This means the mapping from input context (represented by Values, V) to output context (the weighted sum PV) doesn't necessarily require the full expressive power of an arbitrary N×N matrix. If this hypothesis holds, we can potentially achieve significant efficiency gains by working with a compressed representation.
Linformer: Projecting Keys and Values
Linformer (Linear Transformer) proposes a clever way to achieve linear complexity by introducing learned projection matrices Ei and Fi. Instead of computing the full N×N attention matrix, Linformer projects the Key (K) and Value (V) matrices along the sequence length dimension before the attention computation.
Let the input sequence length be N and the head dimension be dk (for Keys/Queries) or dv (for Values). The original matrices are:
Query Q∈RN×dk
K∈RN×dk
Value V∈RN×dv
Linformer introduces two projection matrices, E∈Rk×N and F∈Rk×N, where k is a projected dimension significantly smaller than N (k≪N). These matrices are used to create projected Key and Value matrices:
Notice how the projection reduces the sequence dimension from N to k. The main step is that the attention scores are now computed between the original Query matrix Q and the projected Key matrix Kproj:
Pproj=Softmax(dkQKprojT)(dimension N×k)
The final output is then obtained by multiplying this projected attention score matrix Pproj with the projected Value matrix Vproj:
Let's analyze the computational complexity. The original attention calculation is dominated by the QKT matrix multiplication, which takes O(N2dk) time, and the subsequent multiplication by V, which takes O(N2dv) time.
In Linformer:
Projecting K: EK takes O(Nkdk) time.
Projecting V: FV takes O(Nkdv) time.
Computing QKprojT: This involves multiplying an N×dk matrix by a dk×k matrix, resulting in O(Nkdk) time.
Computing PprojVproj: This involves multiplying an N×k matrix by a k×dv matrix, resulting in O(Nkdv) time.
Since k is chosen such that k≪N, the overall complexity of Linformer's attention mechanism becomes O(Nk), which is linear with respect to the sequence length N. This is a substantial improvement over the standard O(N2) complexity.
Comparison of computational flow in standard self-attention versus Linformer's projected attention. Linformer introduces projection matrices (E, F) to reduce the dimensionality of Keys and Values along the sequence length axis before calculating attention scores.
Implementation
Projection Matrices: The projection matrices E and F are typically learned during training. They can be shared across different attention heads or even layers to reduce the parameter count further. A common implementation uses simple linear layers acting on the Key and Value matrices transposed (KT, VT) to perform the projection efficiently.
Choice of k: The projected dimension k is a hyperparameter. Values like 128, 256, or 512 are often used, significantly smaller than typical sequence lengths (thousands or tens of thousands) where Linformer becomes advantageous. The choice impacts the trade-off between computational efficiency and model accuracy. A smaller k is faster but might lead to a larger approximation error.
Theoretical Guarantees: The Linformer paper provides theoretical analysis showing that under certain assumptions, the self-attention matrix is indeed low-rank and can be well-approximated by this projection method.
Advantages and Disadvantages
Advantages:
Linear Time Complexity: Reduces attention time complexity from O(N2) to O(Nk).
Linear Space Complexity: Reduces memory usage associated with storing the attention matrix.
Scalability: Enables processing much longer sequences than standard Transformers.
Simple Modification: Relatively straightforward to implement by adding projection layers.
Disadvantages:
Approximation Error: As an approximation, it may not capture all the nuances of the full attention matrix, potentially leading to a performance decrease on certain tasks compared to standard attention (especially for shorter sequences where N2 is manageable).
Hyperparameter Tuning: Requires tuning the projection dimension k.
Low-Rank Assumption: Its effectiveness relies on the underlying assumption that the attention matrix can be well approximated by a low-rank structure, which might not hold equally well for all data or tasks.
Linformer represents a significant step towards building more efficient Transformer models. By challenging the necessity of the full quadratic attention computation and leveraging the concept of low-rank approximation, it offers a practical way to scale Transformers to longer sequences, opening up possibilities for applications involving extensive documents, high-resolution images, or long-form audio.
Was this section helpful?
Linformer: Self-Attention with Linear Complexity, Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma, 2020Advances in Neural Information Processing Systems (NeurIPS), Vol. 33 (Curran Associates, Inc.)DOI: 10.48550/arXiv.2006.04768 - Introduces the Linformer model, which achieves linear complexity in self-attention by projecting keys and values to a lower dimension.
Nyströmformer: A Nyström-Algorithm-based Efficient Transformer, Yuqi Xiong, Ye Li, Bo Li, Zhanpeng Zeng, Eytan Hu, Lizhen Nie, Chao Zhang, Mohan Teng, Xiao Zhang, Hao Ma, 2021International Conference on Machine Learning (ICML), Vol. 139DOI: 10.48550/arXiv.2102.03906 - Presents an alternative efficient Transformer that uses the Nyström method for low-rank approximation of attention matrices.
A Survey of Efficient Transformers, Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler, 2022ACM Computing Surveys, Vol. 35 (ACM)DOI: 10.48550/arXiv.2009.06732 - Offers an overview and taxonomy of various efficient Transformer architectures, including Linformer and other linear attention models.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30 (Curran Associates, Inc.)DOI: 10.48550/arXiv.1706.03762 - The foundational paper that introduced the Transformer architecture and the self-attention mechanism, highlighting the quadratic complexity bottleneck that Linformer addresses.