As established, relying on a single self-attention mechanism can constrain the model's ability to capture the multifaceted relationships present within a sequence. A single attention calculation might average diverse relational signals, potentially losing specific, valuable information. Multi-Head Attention addresses this by performing attention computations multiple times in parallel, allowing the model to jointly attend to information from different perspectives or "representation subspaces."
The first step in enabling this parallel processing is to generate distinct Query (Q), Key (K), and Value (V) vectors for each attention "head". Instead of feeding the raw input embeddings directly into each head's attention calculation, we apply separate learned linear transformations to the input sequence for each head.
Let the input sequence representation be a matrix X∈Rn×dmodel, where n is the sequence length and dmodel is the dimension of the input embeddings (and the model's hidden state size). For a multi-head attention mechanism with h heads, we introduce h sets of learnable weight matrices:
Here, dk represents the dimension of the Keys and Queries for each head, and dv represents the dimension of the Values for each head. A common and effective choice, as used in the original Transformer paper ("Attention Is All You Need"), is to set dk=dv=dmodel/h. This ensures that the total computational cost across all heads remains roughly similar to a single attention mechanism operating on the full dmodel dimension, while distributing the representational capacity.
For each head i, the corresponding Query, Key, and Value matrices (Qi, Ki, Vi) are computed by projecting the input X using these weight matrices:
Qi=XWiQ Ki=XWiK Vi=XWiVThe resulting matrices Qi,Ki∈Rn×dk and Vi∈Rn×dv now represent the input sequence transformed into the specific subspace relevant for head i.
Flow showing input embeddings X being projected independently for each attention head i using distinct weight matrices (WiQ,WiK,WiV) to produce head-specific Query (Qi), Key (Ki), and Value (Vi) matrices. The projection typically reduces dimensionality from dmodel to dk or dv.
These independent linear transformations are significant for several reasons:
In practice, these linear projections are implemented using standard feed-forward layers (often linear layers or dense layers in frameworks like PyTorch or TensorFlow) without bias terms. The weight matrices WiQ,WiK,WiV are initialized randomly and learned jointly with the rest of the network parameters during the training process via backpropagation.
Once these head-specific Qi,Ki,Vi matrices are computed, they become the inputs for the parallel scaled dot-product attention computations, which we will examine next. This projection step is fundamental to how Multi-Head Attention achieves its ability to analyze input sequences from multiple perspectives simultaneously.
© 2025 ApX Machine Learning