Integrating token embeddings and positional encodings is a fundamental step in preparing input for Transformer models. Token embeddings represent input tokens as dense vectors, while positional encodings generate distinct signals indicating their positions. The self-attention mechanism, a core component of Transformers, processes input elements without inherent knowledge of their sequential order. Consequently, this positional context must be supplied directly within the input representations fed into the Transformer stack.
The standard and remarkably effective method proposed in the original Transformer paper ("Attention Is All You Need") is straightforward: element-wise addition. If E∈RL×dmodel represents the matrix of token embeddings for a sequence of length L, and P∈RL×dmodel represents the corresponding positional encodings, the combined input representation X∈RL×dmodel is computed as:
X=E+PThis means for each token i in the sequence (where i ranges from 0 to L−1), its final input vector xi is the sum of its token embedding ei and its positional encoding pi:
xi=ei+piHere, both the token embedding ei and the positional encoding pi must have the same dimensionality, dmodel. This consistency in dimension is a fundamental requirement for the addition operation.
Why simple addition? While other composition functions could be conceived, addition offers several advantages:
In practice, combining embeddings and positional encodings typically involves these steps:
[batch_size, sequence_length, d_model].[1, sequence_length, d_model] or [sequence_length, d_model] which can be broadcast or sliced.[1, sequence_length, d_model] or similar, containing trainable vectors.Flow diagram illustrating the combination of token embeddings and positional encodings via element-wise addition to create the input representation for the Transformer layers.
If learned positional embeddings are used instead of sinusoidal ones, the combination mechanism remains the same: element-wise addition. The primary difference is that the vectors representing positions 0,1,2,… are now parameters optimized during the training process, rather than being determined by fixed functions. The model learns the positional representations that are most effective for the task, given the data. The addition step still merges these learned positional signals with the semantic token embeddings.
By adding the positional encodings directly to the token embeddings, we create an input representation X that equips the subsequent Transformer layers with both the 'what' (semantics from E) and the 'where' (sequence order from P) information necessary for sophisticated sequence understanding. This combined representation forms the input to the first layer of the encoder or decoder stack.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with