As we established in the preceding chapters, the self-attention mechanism provides a powerful way for a model to dynamically weigh the importance of different elements within an input sequence when computing a representation for each element. It achieves this by calculating alignment scores between pairs of elements using Query, Key, and Value vectors derived from the input.
However, a significant characteristic of the core self-attention calculation, specifically the scaled dot-product attention:
Attention(Q,K,V)=softmax(dkQKT)Vis that it is inherently permutation invariant. This means that if we were to reorder the input tokens (and their corresponding Q,K,V vectors), the attention output for any given token, considering its relationship to all other tokens, would effectively remain the same, just reordered according to the initial permutation. The attention mechanism, in its basic form, treats the input as a set of vectors rather than an ordered sequence.
Consider a simple sentence: "robot detects anomaly". If the model processes this using only token embeddings and self-attention, the attention scores between "robot" and "detects", or "robot" and "anomaly", are computed based solely on the vector representations of these words, irrespective of their positions (1st, 2nd, or 3rd). If we shuffled the input to "anomaly detects robot", the pairwise attention scores between the embeddings would be identical. The resulting contextualized representations would be different only because the input vectors themselves are different at each position, but the mechanism itself hasn't processed the order.
This contrasts sharply with recurrent architectures like RNNs or LSTMs. In an RNN, the computation at time step t directly depends on the hidden state from time step t−1: ht=f(ht−1,xt). This sequential processing naturally incorporates the order of elements. Transformers, by processing all tokens in parallel via self-attention, lose this built-in sense of sequence order.
For almost any task involving sequences, especially in natural language processing, order is fundamental to meaning. "The dog chased the cat" has a completely different meaning from "The cat chased the dog". Time series predictions rely heavily on the temporal ordering of observations. Therefore, since the self-attention mechanism itself doesn't capture positional information, we must find an alternative way to provide this information to the model. The representations fed into the Transformer layers need to contain signals that indicate not just what a token is, but also where it is located within the sequence. This is the primary motivation for incorporating positional encodings, which we will explore next.
© 2025 ApX Machine Learning