All Courses

Introduction to Multi-Head Attention

We've explored how Scaled Dot-Product Attention allows a model to calculate relevance scores between different positions in a sequence, using Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ) vectors. This mechanism is effective, enabling the model to focus on pertinent parts of the input when generating an output representation for a specific position.

However, is a single attention calculation enough? When we process language, we often consider multiple aspects simultaneously. For example, when reading "The quick brown fox jumps over the lazy dog," understanding the word "jumps" might involve looking at the subject ("fox") for grammatical agreement, but also potentially relating it to the object ("dog") or the manner ("quick"). A single set of attention weights might struggle to capture these diverse relationships effectively. It might average out different types of dependencies or focus predominantly on one aspect, potentially missing other significant connections.

This observation leads to the concept of Multi-Head Attention. The core idea is straightforward but powerful: instead of performing just one attention calculation based on the original $Q$ , $K$ , and $V$ vectors, we perform multiple attention calculations in parallel. Each parallel calculation is referred to as an "attention head."

Think of it like having multiple experts examining the same sequence. Each expert (head) might specialize in looking for different patterns or relationships. One head might focus on short-range syntactic links, another on longer-range semantic similarities, and yet another on positional relationships.

Critically, each attention head doesn't just recalculate attention on the same original $Q$ , $K$ , and $V$ . Instead, before calculating attention, the model learns separate linear projections for each head. This means the original $Q$ , $K$ , and $V$ vectors are projected into different, lower-dimensional subspaces for each head.

Let's say we have $h$ attention heads. For each head $i$ (where $i$ ranges from 1 to $h$ ), we learn projection matrices $W^Q_i$ , $W^K_i$ , and $W^V_i$ . The input $Q$ , $K$ , and $V$ are then projected as follows:

$Q_i = Q W^Q_i$ $K_i = K W^K_i$ $V_i = V W^V_i$

Each head $i$ then performs the Scaled Dot-Product Attention calculation using its own projected $Q_i$ , $K_i$ , and $V_i$ :

\text{head}_i = \text{Attention}(Q_i, K_i, V_i) = \text{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d_k}}\right)V_i

Here, $d_k$ is the dimension of the vectors within a single head (i.e., the dimension of $K_i$ ).

The intuition is that each projection ( $W^Q_i, W^K_i, W^V_i$ ) allows a head to learn to focus on a different aspect or "representation subspace" of the information contained in the original embeddings. By running these calculations in parallel, the model can simultaneously gather insights about different types of relationships within the sequence.

This parallel processing allows the model to jointly attend to information from different positions based on different representational criteria. The combined output, as we'll see in the next section, provides a richer and more multifaceted representation compared to what a single attention mechanism could produce. We will now look at how these parallel computations are managed and combined to produce the final output of the Multi-Head Attention layer.

Was this section helpful?