Having projected the original queries (Q), keys (K), and values (V) into h different subspaces using distinct learned linear transformations WiQ,WiK,WiV for each head i, the next step is to perform the attention calculation independently and simultaneously for each head. This parallel processing is a defining characteristic of multi-head attention and a significant contributor to its effectiveness and computational profile.
For each head i, where i ranges from 1 to h, we compute the scaled dot-product attention exactly as defined previously, but using the projected matrices Qi, Ki, and Vi specific to that head:
headi=Attention(Qi,Ki,Vi)=softmax(dkiQiKiT)ViHere:
It's important to manage the dimensions correctly. If the input embedding dimension is dmodel and we use h heads, the projections are typically designed such that the dimension for each head's keys, queries (dki), and values (dvi) are equal: dki=dvi=dmodel/h. This division ensures that the total computational cost is similar to that of a single-head attention mechanism with the full dmodel dimension, while distributing the representational capacity across multiple heads. Furthermore, it guarantees that when the outputs of all heads are later concatenated, the resulting dimension matches the expected input dimension dmodel for subsequent layers, maintaining consistency throughout the model architecture.
Assuming an input sequence of length N (number of tokens), the shapes of the matrices for head i (omitting the batch dimension for simplicity) are typically:
Consequently, the output of the attention calculation for head i, denoted headi, will have the shape N×dvi. Since we usually set dvi=dki=dmodel/h, the output shape is N×(dmodel/h).
Computationally, this structure is highly amenable to parallelization. Modern deep learning frameworks and hardware like GPUs excel at performing large matrix multiplications. Instead of iterating through each head sequentially, the computations for all h heads can often be executed in parallel. This is typically achieved by reshaping the projected Q, K, and V tensors to include a distinct "head" dimension before the attention calculation. For instance, a tensor representing queries for a batch might be reshaped from (batch_size, seq_len, d_model)
to (batch_size, num_heads, seq_len, d_k_i)
. The batched matrix multiplications (matmul
) required for the dkiQiKiT term, followed by the softmax
and the final matmul
with Vi, can then operate efficiently across both the batch and head dimensions simultaneously.
Independent scaled dot-product attention computations are performed for each head using its specific projected Q, K, V matrices (Qi,Ki,Vi). The outputs (head1,...,headh), each with dimension N×dvi, are generated in parallel before being passed to the next stage. Note dki and dvi represent dmodel/h.
The primary advantage of this parallel structure extends beyond computational efficiency. It permits each attention head to potentially specialize and learn different types of relationships or attend to information from distinct representation subspaces simultaneously. For instance, one head might learn to focus on local syntactic dependencies (like adjective-noun agreement), while another captures longer-range semantic connections (like coreference resolution across sentences), and yet another might focus on positional relationships. A single attention mechanism would be forced to average these potentially disparate signals, which could dilute the information. Multi-head attention provides multiple independent "channels" for information flow, allowing the model to aggregate diverse relational information and ultimately build richer, more context-aware representations.
The outputs of these parallel computations, head1,head2,...,headh, capture different aspects of the input sequence's internal relationships. They are now ready to be combined in the next step: concatenation followed by a final linear projection.
© 2025 ApX Machine Learning