Previous chapters detailed the self-attention mechanism, which allows models to weigh the significance of different input elements relative to each other. However, the core attention calculation, operating on sets of queries, keys, and values, does not inherently account for the sequential order of the input. This chapter addresses this gap.
We will first look at the input embedding layer, responsible for converting input tokens into continuous vector representations. The primary focus will then shift to techniques for encoding positional information. We will study the widely used sinusoidal positional encoding scheme, including its mathematical formulation, such as:
PE(pos,2i)=sin(pos/100002i/dmodel) PE(pos,2i+1)=cos(pos/100002i/dmodel)
We will analyze the properties that make these functions suitable and how they are typically combined with the token embeddings. We will also discuss alternative approaches like learned positional embeddings and compare their characteristics.
4.1 The Need for Positional Information
4.2 Input Embedding Layer Transformation
4.3 Sinusoidal Positional Encoding: Formulation
4.4 Properties of Sinusoidal Encodings
4.5 Combining Embeddings and Positional Encodings
4.6 Alternative: Learned Positional Embeddings
4.7 Comparison: Sinusoidal vs. Learned Embeddings
4.8 Practice: Generating and Visualizing Positional Encodings
© 2025 ApX Machine Learning