While standard self-attention provides powerful sequence modeling capabilities, its quadratic computational and memory complexity (O(N2) for sequence length N) remains a significant bottleneck, especially when dealing with long sequences encountered in areas like high-resolution image generation, document summarization, or genomic data analysis. The Performer architecture, introduced by Choromanski et al. (2020), offers an effective solution by approximating the attention mechanism with linear complexity. It achieves this through a clever technique called Fast Attention Via positive Orthogonal Random Features (FAVOR+).
Recall the standard scaled dot-product attention:
Attention(Q,K,V)=softmax(dkQKT)VHere, Q,K,V are matrices of query, key, and value vectors, respectively, each with N rows (sequence length) and dk (or dv) columns (embedding dimension). The primary computational cost lies in computing the N×N attention matrix A=softmax(QKT/dk). This requires O(N2dk) operations and O(N2) memory, which quickly becomes prohibitive as N increases.
The core idea behind Performers is to avoid the explicit computation of the N×N matrix A. Instead, it seeks to approximate the attention mechanism using kernel methods. Let's rewrite the i-th output vector of the attention mechanism (before normalization) as:
Att′(Q,K,V)i=j=1∑Nexp(dkqiTkj)vjThe standard attention output is then obtained by normalizing this sum using the denominator Di=∑j=1Nexp(qiTkj/dk).
Performers leverage the concept that the function sim(qi,kj)=exp(qiTkj/dk) resembles a kernel function, specifically a Gaussian kernel after appropriate scaling and transformation. Kernel methods often allow calculating similarities implicitly using feature maps. Performers propose finding a feature map ϕ:Rdk→Rr, where r is the dimension of the random features (typically r≪N), such that the kernel similarity can be approximated by an inner product in the feature space:
exp(dkqiTkj)≈ϕ(qi)Tϕ(kj)If such a feature map ϕ exists, the attention computation can be cleverly reordered to avoid the N×N calculation. The unnormalized attention sum becomes:
Att′(Q,K,V)i≈j=1∑N(ϕ(qi)Tϕ(kj))vj=ϕ(qi)Tj=1∑N(ϕ(kj)vjT)Similarly, the denominator can be approximated as:
Di≈j=1∑Nϕ(qi)Tϕ(kj)=ϕ(qi)Tj=1∑Nϕ(kj)Notice the crucial change in the order of operations. We can first compute the sums ∑j=1N(ϕ(kj)vjT) (an r×dv matrix) and ∑j=1Nϕ(kj) (an r×1 vector). These computations involve processing the keys and values once, taking roughly O(Nrdk+Nrdv) time. Then, for each query qi, we compute ϕ(qi) and perform two matrix-vector multiplications, taking O(rdk+rdv) time per query. The total time complexity becomes approximately O(Nr(dk+dv)), which is linear in N if r is considered constant or grows much slower than N. The memory complexity also reduces to O(Nr+Ndk+Ndv), primarily for storing the mapped queries, keys, values, and intermediate sums.
Comparison of computational flow. Standard attention requires forming the expensive N×N matrix (red nodes). Performer uses feature maps ϕ to compute intermediate sums (yellow nodes) involving linear complexity operations O(N), avoiding the quadratic bottleneck.
The effectiveness of Performers hinges on finding a suitable feature map ϕ. The FAVOR+ mechanism achieves this by constructing ϕ using random features, drawing inspiration from techniques used to approximate Gaussian kernels (like Random Fourier Features). A key contribution of Performers is the development of feature maps that guarantee positivity (ϕ(x)Tϕ(y)≥0), which helps maintain stability and better approximates the properties of the softmax function (whose output components are non-negative).
Specifically, Performers define feature maps based on random projections and trigonometric functions, for example:
ϕ(x)=mh(x)[f1(w1Tx),…,fm(wmTx),f1′(w1Tx),…,fm′(wmTx)]Twhere wi are randomly sampled vectors, h(x) is a function like exp(∥x∥2/2), and fl,fl′ are pairs like (sin,cos) or (exp,exp). The dimension of the feature map r is 2m. The exact construction ensures unbiased or near-unbiased estimation of the softmax kernel with positive outputs.
Advantages:
Considerations:
By approximating the softmax kernel using positive random features, Performers provide a computationally efficient and theoretically grounded method to scale Transformer models to lengths previously considered infeasible, significantly broadening their applicability.
© 2025 ApX Machine Learning