Integrating token embeddings and positional encodings is a fundamental step in preparing input for Transformer models. Token embeddings represent input tokens as dense vectors, while positional encodings generate distinct signals indicating their positions. The self-attention mechanism, a core component of Transformers, processes input elements without inherent knowledge of their sequential order. Consequently, this positional context must be supplied directly within the input representations fed into the Transformer stack.
The standard and remarkably effective method proposed in the original Transformer paper ("Attention Is All You Need") is straightforward: element-wise addition. If represents the matrix of token embeddings for a sequence of length , and represents the corresponding positional encodings, the combined input representation is computed as:
This means for each token in the sequence (where ranges from to ), its final input vector is the sum of its token embedding and its positional encoding :
Here, both the token embedding and the positional encoding must have the same dimensionality, . This consistency in dimension is a fundamental requirement for the addition operation.
Why simple addition? While other composition functions could be conceived, addition offers several advantages:
In practice, combining embeddings and positional encodings typically involves these steps:
[batch_size, sequence_length, d_model].[1, sequence_length, d_model] or [sequence_length, d_model] which can be broadcast or sliced.[1, sequence_length, d_model] or similar, containing trainable vectors.Flow diagram illustrating the combination of token embeddings and positional encodings via element-wise addition to create the input representation for the Transformer layers.
If learned positional embeddings are used instead of sinusoidal ones, the combination mechanism remains the same: element-wise addition. The primary difference is that the vectors representing positions are now parameters optimized during the training process, rather than being determined by fixed functions. The model learns the positional representations that are most effective for the task, given the data. The addition step still merges these learned positional signals with the semantic token embeddings.
By adding the positional encodings directly to the token embeddings, we create an input representation that equips the subsequent Transformer layers with both the 'what' (semantics from ) and the 'where' (sequence order from ) information necessary for sophisticated sequence understanding. This combined representation forms the input to the first layer of the encoder or decoder stack.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with