Scaled Dot-Product Attention enables models to prioritize different tokens within a sequence. However, performing this calculation only once can compel the attention mechanism to average diverse types of relationships. For example, consider a sentence such as 'The tired animal didn't cross the street because it was too wide.' A single attention mechanism might struggle to simultaneously capture both the 'tired animal' relationship and the 'street width' relationship effectively when focusing on the word 'it'.
Multi-Head Attention addresses this by running the Scaled Dot-Product Attention process multiple times in parallel, each with different learned transformations of the original queries, keys, and values. This allows each "head" to potentially focus on different aspects or representation subspaces of the information.
Here's the step-by-step process:
Linear Projections: Instead of using a single set of Query (Q), Key (K), and Value (V) matrices, Multi-Head Attention first creates different sets of these matrices, where is the number of attention heads (a hyperparameter). For each head (from to ), the original input Q, K, and V matrices (often derived from the same input sequence embeddings in the case of self-attention) are projected using learned weight matrices: , , and .
Typically, the dimensions of these projected matrices are smaller than the original embedding dimension (). If the input embedding dimension is , each head often works with dimensions . This ensures that the total computational cost is similar to a single head attention with full dimensions. These weight matrices () are unique for each head and are learned during the training process.
Parallel Attention Calculations: Each of these projected sets () is then fed into its own Scaled Dot-Product Attention mechanism simultaneously. This results in separate output matrices, let's call them :
Each matrix captures attention information based on the specific projections learned by head . Because the projections differ ( are different for each ), each head can potentially learn to focus on different types of relationships or features within the input sequence.
Concatenation: The outputs from all attention heads are concatenated together along the feature dimension. If each has dimension , the concatenated matrix will have dimension . Since we typically set , the dimension of the concatenated matrix becomes , matching the original input embedding dimension.
Final Linear Projection: This concatenated output is then passed through one final linear projection layer, parameterized by another learned weight matrix . This projection mixes the information learned by the different heads and produces the final output of the Multi-Head Attention layer, which typically has the dimension .
This entire Multi-Head Attention block can then be used as a component within the larger Transformer architecture, replacing the single Scaled Dot-Product Attention mechanism.
The following diagram illustrates the flow of information through a Multi-Head Attention block with heads.
This diagram shows how input Q, K, and V matrices are first projected independently for each of the attention heads. Scaled Dot-Product Attention is then applied to each projected set in parallel. The resulting attention outputs are concatenated and passed through a final linear layer to produce the Multi-Head Attention output.
By allowing different heads to learn different projection matrices (), Multi-Head Attention enables the model to jointly attend to information from different representation subspaces at different positions, leading to a richer and more effective representation compared to using a single attention mechanism.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with