Positional Encoding

Natural language processing (NLP) relies heavily on understanding the sequence and order of words to derive meaning from text. While traditional models like Recurrent Neural Networks (RNNs) naturally handle sequential data by processing one token at a time, Transformers, with their parallel processing capability, inherently lack sensitivity to the sequence of input tokens. This is where positional encoding plays an important role, providing the Transformer model with a mechanism to understand the order of words in a sequence.

Positional encoding is a technique used to inject information about the position of tokens in a sequence directly into the input embeddings. This is achieved by adding a set of vectors to the input embeddings, each corresponding to a position in the sequence. Unlike RNNs, which capture positional information through their iterative structure, Transformers make use of these encodings to maintain positional context.

Mathematically, positional encodings can be understood as a set of sinusoidal functions applied to each position in the sequence. The formulation is as follows, where $pos$ is the position, $i$ is the dimension, and $d_{\text{model}}$ is the model dimension:

PE_{(pos, 2i)} = \sin \left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)

PE_{(pos, 2i+1)} = \cos \left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)

Visualization of the sinusoidal positional encoding functions for the first 11 positions

This choice of sinusoidal functions is not arbitrary. It allows the model to learn relative positions because any linear shift in position results in a predictable phase shift in the sine and cosine outputs. This property helps the model in generalizing to longer sequences or sequences with unseen lengths.

Let's look into a brief Python code snippet that shows how to generate positional encodings:

import numpy as np

def positional_encoding(max_len, d_model):
    position = np.arange(max_len)[:, np.newaxis]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    pe = np.zeros((max_len, d_model))
    pe[:, 0::2] = np.sin(position * div_term)
    pe[:, 1::2] = np.cos(position * div_term)
    return pe

# Example usage
max_len = 100
d_model = 512
pos_encoding = positional_encoding(max_len, d_model)
print(pos_encoding.shape)  # Should output (100, 512)

In this snippet, max_len represents the maximum sequence length, and d_model is the dimensionality of the model's input embeddings. The function positional_encoding generates a matrix where each row corresponds to the positional encoding of a particular position in the sequence.

Integrating these encodings into the Transformer model involves simply adding them to the input embeddings before feeding them into the subsequent layers. This addition is performed element-wise, ensuring that the positional information is smoothly blended with the semantic information in the embeddings.

The choice of using sinusoidal encodings, as opposed to learned encodings, is motivated by their ability to generalize to inputs longer than those encountered during training. This is particularly useful when dealing with sequences of varying lengths, a common scenario in NLP tasks.

In summary, positional encoding is a key part of the Transformer architecture, bridging the gap between its parallel processing prowess and the need for sequential understanding. By incorporating positional encodings, Transformers effectively capture the order of tokens, enabling them to achieve remarkable performance across a wide array of NLP tasks. As you continue to explore the details of Transformer models, understanding and using positional encoding will help you implement and optimize these powerful systems.