As we've established, self-attention provides a powerful mechanism for relating different positions within a sequence. However, applying a single scaled dot-product attention function can act like a bottleneck. It forces the model to average potentially diverse types of relationships or dependencies into a single weighted representation. Consider a sentence where understanding requires tracking both syntactic structure (like subject-verb agreement) and semantic meaning (like word similarity). A single attention function might struggle to capture both aspects effectively simultaneously, potentially blending them indistinctly.
This limitation motivates the use of Multi-Head Attention. The fundamental idea is elegantly simple: instead of calculating attention just once, we perform multiple attention calculations in parallel. Each parallel computation is termed an attention head.
Think of it like having multiple perspectives on the same input. If you were analyzing a complex system, you might consult experts from different fields. Each expert (head) brings a unique viewpoint, focusing on different aspects of the system (representation subspaces). Multi-head attention operates similarly.
Each attention head performs the same core scaled dot-product attention calculation we saw previously. However, they don't all work on the identical Query (Q), Key (K), and Value (V) matrices derived directly from the input sequence embeddings. Instead, before the attention calculation in each head, the Q, K, and V matrices undergo independent linear transformations (projections) specific to that head. This means each head learns to project the input into a subspace that is potentially more suitable for capturing a specific type of relationship.
Why is this beneficial?
Essentially, multi-head attention gives the model multiple "looks" at the input sequence, each focusing through a different learned lens (the linear projections). This parallel processing allows for a more comprehensive understanding of the intricate relationships within the data. The next sections will explain exactly how these linear projections work for each head and how their parallel outputs are eventually integrated.
© 2025 ApX Machine Learning