As introduced, the standard self-attention mechanism, while powerful, carries a significant computational burden, especially as input sequences grow longer. Understanding this computational cost is fundamental to appreciating the motivation behind the more efficient architectures we will explore later in this chapter.
Let's analyze the core operations within the scaled dot-product attention formula, which forms the basis of self-attention in Transformers:
Attention(Q,K,V)=softmax(dkQKT)VHere, Q (Query), K (Key), and V (Value) are matrices derived from the input sequence embeddings. Let N be the sequence length and dk be the dimension of the keys and queries, and dv be the dimension of the values.
The primary computational steps are:
Query-Key Similarity Calculation: Computing the matrix product QKT.
Scaling and Softmax: Scaling the scores by 1/dk and applying the softmax function row-wise.
Value Aggregation: Multiplying the softmax output (the attention weights matrix A, dimensions (N×N)) by the Value matrix V.
Combining these steps, the total computational complexity is dominated by the two large matrix multiplications: O(N2dk+N2dv).
In many standard Transformer configurations, the dimensions dk and dv are proportional to the overall model embedding dimension dmodel (often dk=dv=dmodel/h, where h is the number of attention heads). Therefore, the complexity is commonly summarized as:
O(N2⋅dmodel)This quadratic dependence on the sequence length N is the critical bottleneck. While the operations are highly parallelizable across the sequence dimension (unlike recurrent models), the total number of operations grows rapidly with N.
Beyond computation, there's also a significant memory requirement. The intermediate attention score matrix QKT (before or after softmax) has dimensions (N×N). Storing this matrix requires:
O(N2)memory. For long sequences (e.g., N>4096), storing an (N×N) matrix of floating-point numbers can exceed the memory capacity of typical accelerators like GPUs, even before considering the memory needed for activations, gradients, and model parameters.
The graph illustrates the rapid divergence in computational cost. As sequence length N increases, the O(N2) cost of standard self-attention quickly surpasses linear O(N) growth, making it impractical for very long sequences. The Y-axis uses a logarithmic scale to accommodate the large range of values.
This quadratic computational and memory complexity severely limits the application of standard Transformers to tasks involving very long sequences, such as:
Even with powerful hardware, sequence lengths beyond a few thousand tokens become challenging from both a runtime and memory perspective during training and inference. This limitation directly motivates the development of the alternative attention mechanisms and architectural variants discussed in the following sections, which aim to reduce this quadratic dependency.
© 2025 ApX Machine Learning