As discussed earlier in this chapter, the quadratic complexity of standard self-attention, O(N2d), where N is the sequence length and d is the model dimension, represents a significant bottleneck for processing long sequences. While powerful, calculating attention scores between every pair of tokens becomes computationally prohibitive as N increases.
Several approaches aim to alleviate this computational burden. One prominent direction involves approximating the attention mechanism using low-rank projections, as exemplified by the Linformer model.
The core idea behind Linformer is based on the hypothesis that the self-attention mechanism, despite its potential to model complex interactions across the entire sequence, can often be approximated by a low-rank matrix. In essence, the N×N attention matrix P=Softmax(dkQKT) might be structurally redundant. This means the mapping from input context (represented by Values, V) to output context (the weighted sum PV) doesn't necessarily require the full expressive power of an arbitrary N×N matrix. If this hypothesis holds, we can potentially achieve significant efficiency gains by working with a compressed representation.
Linformer (Linear Transformer) proposes a clever way to achieve linear complexity by introducing learned projection matrices Ei and Fi. Instead of computing the full N×N attention matrix, Linformer projects the Key (K) and Value (V) matrices along the sequence length dimension before the attention computation.
Let the input sequence length be N and the head dimension be dk (for Keys/Queries) or dv (for Values). The original matrices are:
Linformer introduces two projection matrices, E∈Rk×N and F∈Rk×N, where k is a projected dimension significantly smaller than N (k≪N). These matrices are used to create projected Key and Value matrices:
Kproj=EK(dimension k×dk) Vproj=FV(dimension k×dv)Notice how the projection reduces the sequence dimension from N to k. The crucial step is that the attention scores are now computed between the original Query matrix Q and the projected Key matrix Kproj:
Pproj=Softmax(dkQKprojT)(dimension N×k)The final output is then obtained by multiplying this projected attention score matrix Pproj with the projected Value matrix Vproj:
AttentionLinformer(Q,K,V)=PprojVproj(dimension N×dv)Let's analyze the computational complexity. The original attention calculation is dominated by the QKT matrix multiplication, which takes O(N2dk) time, and the subsequent multiplication by V, which takes O(N2dv) time.
In Linformer:
Since k is chosen such that k≪N, the overall complexity of Linformer's attention mechanism becomes O(Nk), which is linear with respect to the sequence length N. This is a substantial improvement over the standard O(N2) complexity.
Comparison of computational flow in standard self-attention versus Linformer's projected attention. Linformer introduces projection matrices (E, F) to reduce the dimensionality of Keys and Values along the sequence length axis before calculating attention scores.
Advantages:
Disadvantages:
Linformer represents a significant step towards building more efficient Transformer models. By challenging the necessity of the full quadratic attention computation and leveraging the concept of low-rank approximation, it offers a practical way to scale Transformers to longer sequences, opening up possibilities for applications involving extensive documents, high-resolution images, or long-form audio.
© 2025 ApX Machine Learning