Having established how to represent input tokens as dense vectors (token embeddings) and how to generate distinct signals representing their positions (positional encodings), the next step is integrating these two sources of information. The self-attention mechanism, as discussed previously, processes input elements without inherent knowledge of their order. Therefore, we must supply this positional context directly within the input representations fed into the Transformer stack.
The standard and remarkably effective method proposed in the original Transformer paper ("Attention Is All You Need") is straightforward: element-wise addition. If E∈RL×dmodel represents the matrix of token embeddings for a sequence of length L, and P∈RL×dmodel represents the corresponding positional encodings, the combined input representation X∈RL×dmodel is computed as:
X=E+PThis means for each token i in the sequence (where i ranges from 0 to L−1), its final input vector xi is the sum of its token embedding ei and its positional encoding pi:
xi=ei+piHere, both the token embedding ei and the positional encoding pi must have the same dimensionality, dmodel. This consistency in dimension is a fundamental requirement for the addition operation.
Why simple addition? While other composition functions could be conceived, addition offers several advantages:
In practice, combining embeddings and positional encodings typically involves these steps:
[batch_size, sequence_length, d_model]
.[1, sequence_length, d_model]
or [sequence_length, d_model]
which can be broadcast or sliced.[1, sequence_length, d_model]
or similar, containing trainable vectors.Flow diagram illustrating the combination of token embeddings and positional encodings via element-wise addition to create the input representation for the Transformer layers.
If learned positional embeddings are used instead of sinusoidal ones, the combination mechanism remains the same: element-wise addition. The primary difference is that the vectors representing positions 0,1,2,… are now parameters optimized during the training process, rather than being determined by fixed functions. The model learns the positional representations that are most effective for the task, given the data. The addition step still merges these learned positional signals with the semantic token embeddings.
By adding the positional encodings directly to the token embeddings, we create an input representation X that equips the subsequent Transformer layers with both the 'what' (semantics from E) and the 'where' (sequence order from P) information necessary for sophisticated sequence understanding. This combined representation forms the input to the first layer of the encoder or decoder stack.
© 2025 ApX Machine Learning