Building upon the scaled dot-product attention mechanism, this chapter focuses on self-attention, a specific application where a sequence relates different positions within itself to compute its representation. Applying only a single attention function, however, can limit its ability to capture diverse relational aspects simultaneously, potentially forcing an undesirable averaging effect.
To overcome this limitation, we examine Multi-Head Attention. This approach allows the model to jointly attend to information from different representation subspaces at different positions, effectively running the attention mechanism in parallel multiple times.
This chapter covers:
By understanding these components, you gain insight into a fundamental building block of the Transformer architecture.
3.1 Self-Attention: Queries, Keys, Values from the Same Source
3.2 Limitations of Single Attention Head
3.3 Introducing Multiple Attention Heads
3.4 Linear Projections for Q, K, V per Head
3.5 Parallel Attention Computations
3.6 Concatenation and Final Linear Projection
3.7 Analysis of What Different Heads Learn
3.8 Hands-on Practical: Building a Multi-Head Attention Layer
© 2025 ApX Machine Learning