While standard self-attention provides strong sequence modeling capabilities, its quadratic computational and memory complexity ( for sequence length ) remains a significant bottleneck, especially when dealing with long sequences encountered in areas like high-resolution image generation, document summarization, or genomic data analysis. The Performer architecture, introduced by Choromanski et al. (2020), offers an effective solution by approximating the attention mechanism with linear complexity. It achieves this through a clever technique called Fast Attention Via positive Orthogonal Random Features (FAVOR+).
Recall the standard scaled dot-product attention:
Here, are matrices of query, key, and value vectors, respectively, each with rows (sequence length) and (or ) columns (embedding dimension). The primary computational cost lies in computing the attention matrix . This requires operations and memory, which quickly becomes prohibitive as increases.
The core idea behind Performers is to avoid the explicit computation of the matrix . Instead, it seeks to approximate the attention mechanism using kernel methods. Let's rewrite the -th output vector of the attention mechanism (before normalization) as:
The standard attention output is then obtained by normalizing this sum using the denominator .
Performers leverage the concept that the function resembles a kernel function, specifically a Gaussian kernel after appropriate scaling and transformation. Kernel methods often allow calculating similarities implicitly using feature maps. Performers propose finding a feature map , where is the dimension of the random features (typically ), such that the kernel similarity can be approximated by an inner product in the feature space:
If such a feature map exists, the attention computation can be cleverly reordered to avoid the calculation. The unnormalized attention sum becomes:
Similarly, the denominator can be approximated as:
Notice the important change in the order of operations. We can first compute the sums (an matrix) and (an vector). These computations involve processing the keys and values once, taking roughly time. Then, for each query , we compute and perform two matrix-vector multiplications, taking time per query. The total time complexity becomes approximately , which is linear in if is considered constant or grows much slower than . The memory complexity also reduces to , primarily for storing the mapped queries, keys, values, and intermediate sums.
Comparison of computational flow. Standard attention requires forming the expensive matrix (red nodes). Performer uses feature maps to compute intermediate sums (yellow nodes) involving linear complexity operations , avoiding the quadratic bottleneck.
The effectiveness of Performers depends on finding a suitable feature map . The FAVOR+ mechanism achieves this by constructing using random features, drawing inspiration from techniques used to approximate Gaussian kernels (like Random Fourier Features). An important contribution of Performers is the development of feature maps that guarantee positivity (), which helps maintain stability and better approximates the properties of the softmax function (whose output components are non-negative).
Specifically, Performers define feature maps based on random projections and trigonometric functions, for example:
where are randomly sampled vectors, is a function like , and are pairs like or . The dimension of the feature map is . The exact construction ensures unbiased or near-unbiased estimation of the softmax kernel with positive outputs.
Advantages:
Considerations:
By approximating the softmax kernel using positive random features, Performers provide a computationally efficient and theoretically grounded method to scale Transformer models to lengths previously considered infeasible, significantly broadening their applicability.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with