Having explored the theoretical underpinnings of sinusoidal positional encodings and why they are necessary to inform the Transformer about sequence order, let's put this into practice. Understanding how these encodings behave visually can provide valuable intuition. In this section, we will implement the positional encoding function and visualize the resulting vectors.
We will use Python with NumPy for numerical computation and Plotly for interactive visualizations, which are well-suited for web-based course materials.
First, let's translate the mathematical formulas for sinusoidal positional encoding into code. Recall the formulas:
PE(pos,2i)=sin(pos/100002i/dmodel) PE(pos,2i+1)=cos(pos/100002i/dmodel)Where pos
is the position in the sequence, i is the index of the dimension within the embedding vector, and dmodel is the dimensionality of the embedding.
Here's a Python function using NumPy to generate these encodings:
import numpy as np
def get_positional_encoding(max_seq_len, d_model):
"""
Generates sinusoidal positional encodings.
Args:
max_seq_len: Maximum sequence length.
d_model: Dimensionality of the model embedding.
Returns:
A numpy array of shape (max_seq_len, d_model) containing
the positional encodings.
"""
if d_model % 2 != 0:
raise ValueError("d_model must be an even number to accommodate sin/cos pairs.")
# Initialize the positional encoding matrix
pos_encoding = np.zeros((max_seq_len, d_model))
# Create a column vector of positions [0, 1, ..., max_seq_len-1]
position = np.arange(max_seq_len)[:, np.newaxis] # Shape: (max_seq_len, 1)
# Calculate the division term: 1 / (10000^(2i / d_model))
# Corresponds to i = 0, 1, ..., d_model/2 - 1
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model)) # Shape: (d_model/2,)
# Apply sin to even indices (2i)
pos_encoding[:, 0::2] = np.sin(position * div_term)
# Apply cos to odd indices (2i + 1)
pos_encoding[:, 1::2] = np.cos(position * div_term)
return pos_encoding
# Example Usage:
max_len = 50 # Maximum sequence length
d_model = 128 # Embedding dimension (must be even)
positional_encodings = get_positional_encoding(max_len, d_model)
print(f"Shape of generated positional encodings: {positional_encodings.shape}")
# Output: Shape of generated positional encodings: (50, 128)
This function takes the maximum sequence length and the model's embedding dimension as input. It calculates the sine values for even indices and cosine values for odd indices based on the position and the div_term
, which represents the frequency component. The result is a matrix where each row corresponds to a position in the sequence, and each column corresponds to a dimension in the positional encoding vector.
Visualizing this matrix helps understand the structure of these encodings. A heatmap is an effective way to see how the encoding values change across positions and dimensions. We'll generate encodings for a sequence length of 50 and an embedding dimension of 128.
Heatmap visualizing sinusoidal positional encodings for a sequence of length 50 and embedding dimension 128. Each row represents a position, and each column represents a dimension index. Color intensity indicates the encoding value.
From the heatmap, several properties discussed earlier become visually apparent:
Let's further examine the uniqueness by plotting the encoding vectors for a few specific positions (e.g., position 0, 10, and 25) across all dimensions.
Line plots comparing the 128-dimensional positional encoding vectors for positions 0, 10, and 25. The distinct shape of each line highlights the unique encoding assigned to each sequence position.
These visualizations confirm that sinusoidal positional encodings provide a distinct signal for each position, varying smoothly across dimensions with different frequencies. This positional signal is then added to the input token embeddings, allowing the subsequent self-attention layers to consider the order of elements in the sequence.
In the next chapter, we will assemble these components, along with the multi-head attention mechanism, into the full Transformer encoder and decoder stacks.
© 2025 ApX Machine Learning