Building upon the concept of attention introduced previously, this chapter focuses on the specific attention mechanisms used within Transformer models. We will examine self-attention, a technique that allows the model to weigh the significance of different words within the same input sequence when processing a specific word.
You will learn how input embeddings are projected into Query (Q), Key (K), and Value (V) vectors, which form the basis for calculating attention scores. We will detail the Scaled Dot-Product Attention formula:
Attention(Q,K,V)=softmax(dkQKT)V
where dk is the dimension of the key vectors.
Furthermore, we will examine Multi-Head Attention. This approach involves running the Scaled Dot-Product Attention mechanism multiple times in parallel with different, learned linear projections of Q, K, and V. This allows the model to jointly attend to information from different representation subspaces at different positions. We will cover its mechanics and the reasoning behind its effectiveness. Finally, the chapter includes a practical exercise where you will implement the Scaled Dot-Product Attention mechanism using a deep learning library.
2.1 The Idea Behind Self-Attention
2.2 Query, Key, and Value Vectors in Self-Attention
2.3 Scaled Dot-Product Attention Mechanism
2.4 Visualizing Self-Attention Scores
2.5 Introduction to Multi-Head Attention
2.6 How Multi-Head Attention Works
2.7 Benefits of Multiple Attention Heads
2.8 Hands-on Practical: Implementing Scaled Dot-Product Attention
© 2025 ApX Machine Learning