While the sinusoidal positional encoding method provides a clever, fixed way to inject sequence order information, it's not the only approach. An alternative strategy, adopted by several influential architectures like BERT and the original GPT models, is to learn the positional embeddings directly, much like word embeddings are learned.
The Concept of Learned Positional Embeddings
Instead of relying on predefined mathematical functions (sine and cosine waves), we can treat each position in the sequence as a distinct entity that requires its own vector representation. The core idea is straightforward:
- Define a maximum possible sequence length, let's call it Lmax, that the model is expected to handle. This is typically determined during model design (e.g., 512, 1024, 2048).
- Create a standard embedding matrix, P, of shape (Lmax,dmodel), where dmodel is the embedding dimension (same as the token embeddings).
- Each row Pi in this matrix represents the learned embedding vector for position i in the sequence, where 0≤i<Lmax.
- These positional embeddings are initialized (often randomly or with some structured initialization) and then trained jointly with the rest of the model parameters via backpropagation.
Integration with Token Embeddings
The integration process mirrors that of sinusoidal encodings. Given an input sequence of length L (where L≤Lmax), we first obtain the token embeddings Etoken∈RL×dmodel. We then look up the corresponding learned positional embeddings for the first L positions from our matrix P, obtaining Eposition=[P0,P1,...,PL−1]∈RL×dmodel.
The final input representation fed into the first Transformer layer is typically the sum of the token embeddings and the learned positional embeddings:
Einput=Etoken+Eposition
This element-wise addition allows the model to combine the semantic meaning of the token with its absolute position within the sequence.
Advantages
- Flexibility: The primary advantage is flexibility. The model isn't constrained by a fixed functional form like sine and cosine. It can theoretically learn positional representations that are optimally suited for the specific task and dataset it's being trained on. If certain positions or positional relationships are particularly important for the task, the model has the capacity to encode this during training.
- Simplicity: The implementation is arguably simpler than calculating sinusoidal values. It leverages the standard embedding lookup mechanism already present in deep learning frameworks.
Disadvantages
- Extrapolation Limitation: Learned embeddings fundamentally struggle with sequences longer than Lmax encountered during training. Since there's no learned embedding vector for positions Lmax and beyond, the model cannot directly process longer sequences without retraining or employing ad-hoc strategies. Sinusoidal encodings, being function-based, can generate encodings for arbitrarily long sequences.
- Increased Parameter Count: This method introduces Lmax×dmodel additional trainable parameters to the model. For large values of Lmax and dmodel, this can significantly increase the model size, memory requirements, and potentially the risk of overfitting, especially if training data is limited.
- Data Dependence: Learning effective positional representations from scratch requires sufficient training data. In scenarios with limited data, the fixed structure of sinusoidal encodings might provide a more robust inductive bias, leading to better generalization.
Choosing Between Learned and Sinusoidal Encodings
The choice between learned and fixed sinusoidal positional encodings often depends on the specific application, model architecture philosophy, and available data.
- Learned Embeddings: Often preferred when maximum sequence length is well-defined and not excessively large, and when sufficient data is available to learn meaningful representations. Architectures like BERT, which are often pretrained on massive datasets, commonly use learned embeddings.
- Sinusoidal Encodings: A strong choice when the ability to handle variable or potentially very long sequences is important, or when parameter efficiency is a concern. The original Transformer paper utilized sinusoidal encodings, and they remain a viable and effective option.
Ultimately, both methods serve the same purpose: providing the permutation-invariant self-attention mechanism with the crucial information about the order of elements in the input sequence. The next section explores into a comparison of their properties.