While the self-attention mechanism, where queries, keys, and values (Q,K,V) originate from the same input sequence, allows a model to weigh the importance of different tokens relative to each other, relying on a single attention calculation per position presents significant constraints.
Think about the output of the scaled dot-product attention function for a specific query (representing a token). It's a weighted sum of the value vectors, where the weights are determined by the compatibility between the query and all keys. This process yields a single context vector for each position in the sequence.
The primary limitation arises because this single attention mechanism must learn to encode multiple types of relationships and features using only one set of attention weights per token. Consider what types of information might be relevant:
A single attention head is forced to average these potentially diverse relational signals into one representation. For instance, attending strongly to a syntactically linked verb might require down-weighting attention to a semantically similar but syntactically distant noun. This averaging effect can create an information bottleneck, potentially preventing the model from capturing distinct, fine-grained patterns simultaneously. If the model learns an "average" attention pattern that tries to accommodate all needs, it might not excel at capturing any specific one particularly well.
Furthermore, the initial linear projections that transform the input embeddings into the single Q,K,V space might also be restrictive. The model learns a single set of projection matrices (WQ,WK,WV). This single transformation might project the input into a subspace that highlights certain features but obscures others. It limits the model's capacity to explore different representational subspaces of the input embeddings where different kinds of relationships might be more apparent.
Imagine processing the sentence: "The model architecture, which relies on attention, performs well." For the token "architecture," a single attention head might need to simultaneously determine its relationship to "model" (modifier) and perhaps "performs" (subject performing an action). The learned attention weights would represent a compromise between these differing relational needs.
This inherent limitation of a single attention calculation motivates the development of a more sophisticated approach: Multi-Head Attention. By performing attention multiple times in parallel with different learned linear projections, the model can jointly attend to information from different representation subspaces, capturing a richer set of features and relationships without forcing them through a single bottleneck. We will examine this mechanism in the next section.
© 2025 ApX Machine Learning