Transformers have fundamentally reshaped sequence modeling and beyond, achieving state-of-the-art results in natural language processing, computer vision, and other domains. Their effectiveness stems from the self-attention mechanism, which allows the model to weigh the importance of different input elements when producing an output, irrespective of their distance. This section guides you through constructing a Transformer model from its fundamental components using PyTorch, fostering a deep understanding of its mechanics. We assume familiarity with basic PyTorch modules (nn.Module
, nn.Linear
, etc.) and deep learning concepts.
The original Transformer, proposed by Vaswani et al. in "Attention Is All You Need" (2017), follows an encoder-decoder structure, suitable for sequence-to-sequence tasks like machine translation.
High-level view of the Encoder-Decoder Transformer architecture. The Encoder processes the input sequence, and the Decoder uses the Encoder's output and the previously generated target sequence to produce the next element.
Many successful models utilize only the encoder stack (e.g., BERT for language understanding) or the decoder stack (e.g., GPT for language generation). We will focus on implementing the core building blocks common to all variants.
Neural networks process numbers, not raw text. Therefore, the first step is to convert input tokens (words, subwords, or characters) into numerical vectors.
This is typically done using an embedding layer, which is essentially a lookup table. Each unique token in the vocabulary is assigned a dense vector of a fixed size, dmodel. In PyTorch, this is straightforwardly implemented using torch.nn.Embedding
.
import torch
import torch.nn as nn
import math
# Example parameters
vocab_size = 10000 # Size of the vocabulary
d_model = 512 # Embedding dimension
embedding = nn.Embedding(vocab_size, d_model)
# Example usage: batch of 2 sequences, length 10
input_tokens = torch.randint(0, vocab_size, (2, 10)) # (batch_size, seq_len)
input_embeddings = embedding(input_tokens) # (batch_size, seq_len, d_model)
print("Input shape:", input_tokens.shape)
print("Embedded shape:", input_embeddings.shape)
The self-attention mechanism, which we'll explore next, processes sequence elements simultaneously. It doesn't inherently consider the order or position of tokens. Without positional information, "the cat sat on the mat" would look the same as "the mat sat on the cat" to the attention mechanism after embedding.
To address this, Transformers inject information about the position of each token into its embedding. The original paper proposed using fixed sinusoidal functions:
PE(pos,2i)=sin(pos/100002i/dmodel) PE(pos,2i+1)=cos(pos/100002i/dmodel)Here, pos is the position of the token in the sequence, and i is the dimension index within the embedding vector (0≤2i<dmodel). Each dimension of the positional encoding corresponds to a sinusoid of a different frequency. This choice allows the model to potentially learn relative positions easily, as PEpos+k can be represented as a linear function of PEpos.
Alternatively, learned positional embeddings (similar to token embeddings, but looking up position indices) can be used. We will implement the sinusoidal version here.
class PositionalEncoding(nn.Module):
def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
# Create position indices (0 to max_len - 1)
position = torch.arange(max_len).unsqueeze(1) # Shape: (max_len, 1)
# Calculate the division term for sine and cosine arguments
div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
# Shape: (d_model / 2)
# Initialize positional encoding matrix
pe = torch.zeros(max_len, d_model) # Shape: (max_len, d_model)
# Apply sin to even indices, cos to odd indices
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# Add a batch dimension and register as a buffer (not a model parameter)
pe = pe.unsqueeze(0) # Shape: (1, max_len, d_model)
self.register_buffer('pe', pe)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: Tensor, shape [batch_size, seq_len, d_model]
Returns:
Tensor, shape [batch_size, seq_len, d_model]
"""
# Add positional encoding to the input embeddings
# self.pe is (1, max_len, d_model). We take the slice up to x's sequence length.
# x is (batch_size, seq_len, d_model)
x = x + self.pe[:, :x.size(1), :]
return self.dropout(x)
# Example Usage
pos_encoder = PositionalEncoding(d_model, dropout=0.1)
final_input = pos_encoder(input_embeddings * math.sqrt(d_model)) # Scale embeddings before adding PE
print("Shape after Positional Encoding:", final_input.shape)
# Note: The original paper scales embeddings by sqrt(d_model) before adding PE.
The final input to the first Transformer layer is the sum of the token embeddings (optionally scaled) and the positional encodings.
Attention mechanisms allow a model to focus on relevant parts of the input sequence when processing a specific element. Self-attention relates different positions of a single sequence to compute a representation of that sequence.
The fundamental building block is Scaled Dot-Product Attention. For each element in the sequence, we compute three vectors: Query (Q), Key (K), and Value (V). These are typically obtained by multiplying the input embedding (plus positional encoding) by learned weight matrices WQ, WK, and WV.
Imagine you are processing the word "making" in the sentence "making Transformer models more interpretable".
The attention score between the query of one word (e.g., "making") and the key of another word (e.g., "Transformer") is computed using a dot product. These scores determine how much attention "making" should pay to "Transformer".
The scores are scaled down by the square root of the key vector dimension (dk) to prevent the dot products from growing too large, which could saturate the softmax function and lead to vanishing gradients. A softmax function then converts these scores into probabilities (weights) that sum to 1.
Finally, the output for the query word ("making") is a weighted sum of all the Value vectors in the sequence, where the weights are the computed probabilities.
The formula is:
Attention(Q,K,V)=softmax(dkQKT)VWhere Q, K, and V are matrices packing the queries, keys, and values for all tokens in the sequence.
Instead of performing a single attention calculation with dmodel-dimensional Q, K, V vectors, Multi-Head Attention projects the input Q, K, V vectors h times (where h is the number of heads) using different learned linear projections (weight matrices) to dimensions dq, dk, dv (typically dq=dk=dv=dmodel/h).
Scaled Dot-Product Attention is then applied independently to each of these projected versions (each "head"). This allows the model to jointly attend to information from different representation subspaces at different positions. It's like asking multiple, different questions (queries) about the input simultaneously.
The outputs from all h heads are concatenated and then passed through a final linear layer (WO) to produce the final output of the multi-head attention layer.
def scaled_dot_product_attention(q, k, v, mask=None):
"""Calculate Scaled Dot Product Attention"""
d_k = q.size(-1) # Get the last dimension (embedding dimension of K)
# Matmul Q and K transpose: (..., seq_len_q, d_k) x (..., seq_len_k, d_k) -> (..., seq_len_q, seq_len_k)
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
# Apply mask (if provided) by setting masked positions to a very small number (-1e9)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Apply softmax to get attention weights
attn_weights = torch.softmax(scores, dim=-1) # (..., seq_len_q, seq_len_k)
# Matmul weights and V: (..., seq_len_q, seq_len_k) x (..., seq_len_v, d_v) -> (..., seq_len_q, d_v)
# Note: seq_len_k == seq_len_v
output = torch.matmul(attn_weights, v)
return output, attn_weights
class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int, num_heads: int):
super().__init__()
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads # Dimension of keys/queries per head
# Linear layers for Q, K, V projections (applied to all heads)
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
# Final linear layer after concatenation
self.W_o = nn.Linear(d_model, d_model)
def split_heads(self, x: torch.Tensor) -> torch.Tensor:
# Input x: (batch_size, seq_len, d_model)
batch_size, seq_len, _ = x.size()
# Reshape to (batch_size, seq_len, num_heads, d_k)
x = x.view(batch_size, seq_len, self.num_heads, self.d_k)
# Transpose to (batch_size, num_heads, seq_len, d_k) for attention calculation
return x.transpose(1, 2)
def combine_heads(self, x: torch.Tensor) -> torch.Tensor:
# Input x: (batch_size, num_heads, seq_len, d_k)
batch_size, _, seq_len, _ = x.size()
# Transpose back to (batch_size, seq_len, num_heads, d_k)
x = x.transpose(1, 2).contiguous() # Ensure memory is contiguous after transpose
# Reshape to (batch_size, seq_len, d_model)
return x.view(batch_size, seq_len, self.d_model)
def forward(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
# q, k, v: (batch_size, seq_len, d_model)
# mask: (batch_size, 1, seq_len_q, seq_len_k) or similar broadcastable shape
# 1. Apply linear projections
q = self.W_q(q) # (batch_size, seq_len_q, d_model)
k = self.W_k(k) # (batch_size, seq_len_k, d_model)
v = self.W_v(v) # (batch_size, seq_len_v, d_model) (seq_len_k == seq_len_v)
# 2. Split into multiple heads
q = self.split_heads(q) # (batch_size, num_heads, seq_len_q, d_k)
k = self.split_heads(k) # (batch_size, num_heads, seq_len_k, d_k)
v = self.split_heads(v) # (batch_size, num_heads, seq_len_v, d_k)
# 3. Apply scaled dot-product attention
# output: (batch_size, num_heads, seq_len_q, d_k)
# attn_weights: (batch_size, num_heads, seq_len_q, seq_len_k)
attention_output, attn_weights = scaled_dot_product_attention(q, k, v, mask)
# 4. Combine heads
output = self.combine_heads(attention_output) # (batch_size, seq_len_q, d_model)
# 5. Final linear layer
output = self.W_o(output) # (batch_size, seq_len_q, d_model)
return output # We usually only need the output, not the weights, for the next layer
# Example Usage
mha = MultiHeadAttention(d_model=512, num_heads=8)
# In self-attention, Q, K, and V are typically the same tensor initially
query = key = value = final_input # Shape: (batch_size, seq_len, d_model)
attention_result = mha(query, key, value, mask=None) # mask is important for padding/decoding
print("Multi-Head Attention output shape:", attention_result.shape)
Masking is essential in two scenarios:
(batch_size, 1, 1, seq_len_k)
and contains 0s for padded positions and 1s otherwise.-1e9
before softmax) the upper triangle of the attention score matrix.Each sub-layer in the Transformer (like Multi-Head Attention or the Feed-Forward Network) has a residual connection around it, followed by layer normalization.
The output of the sub-layer is added to the input of the sub-layer: output = x + Sublayer(x)
. This technique, borrowed from residual networks (ResNets), helps mitigate the vanishing gradient problem in deep networks, allowing gradients to flow more directly through the network during backpropagation. It also enables the training of much deeper models.
Layer Normalization (nn.LayerNorm
) normalizes the activations across the features (the dmodel dimension) for each individual data sample (token) in the batch, independently. This contrasts with Batch Normalization, which normalizes across the batch dimension. Layer Normalization is generally preferred in NLP and Transformers because:
The Add & Norm step is typically implemented as: output = LayerNorm(x + Dropout(Sublayer(x)))
. Dropout is often applied to the output of the sub-layer before the residual addition and normalization.
class AddNorm(nn.Module):
def __init__(self, normalized_shape: int, dropout: float):
super().__init__()
self.layer_norm = nn.LayerNorm(normalized_shape)
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, sublayer_output: torch.Tensor) -> torch.Tensor:
# Apply residual connection and dropout, then layer normalization
return self.layer_norm(x + self.dropout(sublayer_output))
# Example: Applying Add & Norm after Multi-Head Attention
dropout_rate = 0.1
add_norm1 = AddNorm(d_model, dropout_rate)
# 'final_input' is the input to the MHA layer
normed_attention_output = add_norm1(final_input, attention_result)
print("Add & Norm output shape:", normed_attention_output.shape)
Following the attention sub-layer (and its Add & Norm), each position's representation is passed through an identical, independent feed-forward network (FFN). This network typically consists of two linear transformations with a non-linear activation function in between, usually ReLU or GeLU (Gaussian Error Linear Unit).
FFN(x)=max(0,xW1+b1)W2+b2(using ReLU)The dimensionality usually increases in the first linear layer (e.g., to dff=4×dmodel) and then decreases back to dmodel in the second layer. This FFN allows the model to process the information learned via attention at each position independently, adding non-linear modeling capacity.
class PositionWiseFeedForward(nn.Module):
def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.activation = nn.ReLU() # Or nn.GELU()
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(d_ff, d_model)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x: (batch_size, seq_len, d_model)
x = self.linear1(x) # (batch_size, seq_len, d_ff)
x = self.activation(x)
x = self.dropout(x)
x = self.linear2(x) # (batch_size, seq_len, d_model)
return x
# Example Usage
d_ff = d_model * 4 # Common practice
ffn = PositionWiseFeedForward(d_model, d_ff, dropout_rate)
ffn_output = ffn(normed_attention_output)
# Apply the second Add & Norm layer
add_norm2 = AddNorm(d_model, dropout_rate)
# 'normed_attention_output' was the input to the FFN
encoder_layer_output = add_norm2(normed_attention_output, ffn_output)
print("FFN output shape:", ffn_output.shape)
print("Encoder Layer output shape:", encoder_layer_output.shape)
With these components, we can define a complete Encoder Layer and Decoder Layer.
An Encoder Layer consists of:
class EncoderLayer(nn.Module):
def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.add_norm1 = AddNorm(d_model, dropout)
self.ffn = PositionWiseFeedForward(d_model, d_ff, dropout)
self.add_norm2 = AddNorm(d_model, dropout)
def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
# Self-attention sublayer
attn_output = self.self_attn(q=x, k=x, v=x, mask=mask)
x = self.add_norm1(x, attn_output) # Residual connection + norm
# Feed-forward sublayer
ffn_output = self.ffn(x)
x = self.add_norm2(x, ffn_output) # Residual connection + norm
return x
A Decoder Layer is slightly more complex, containing two attention mechanisms:
memory
). Queries (Q) come from the output of the previous decoder sub-layer, while Keys (K) and Values (V) come from the encoder output. This allows the decoder to consider the relevant parts of the input sequence when generating the output sequence. A padding mask from the encoder input might be needed here.class DecoderLayer(nn.Module):
def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float):
super().__init__()
self.masked_self_attn = MultiHeadAttention(d_model, num_heads)
self.add_norm1 = AddNorm(d_model, dropout)
self.encoder_decoder_attn = MultiHeadAttention(d_model, num_heads)
self.add_norm2 = AddNorm(d_model, dropout)
self.ffn = PositionWiseFeedForward(d_model, d_ff, dropout)
self.add_norm3 = AddNorm(d_model, dropout)
def forward(self, x: torch.Tensor, encoder_output: torch.Tensor,
look_ahead_mask: torch.Tensor, padding_mask: torch.Tensor) -> torch.Tensor:
# 1. Masked Self-Attention sublayer
# Q=x, K=x, V=x; use look_ahead_mask
self_attn_output = self.masked_self_attn(q=x, k=x, v=x, mask=look_ahead_mask)
x = self.add_norm1(x, self_attn_output)
# 2. Encoder-Decoder Attention sublayer
# Q=x (from previous layer), K=encoder_output, V=encoder_output
# Use padding_mask relevant to the encoder_output
enc_dec_attn_output = self.encoder_decoder_attn(q=x, k=encoder_output, v=encoder_output, mask=padding_mask)
x = self.add_norm2(x, enc_dec_attn_output)
# 3. Feed-Forward sublayer
ffn_output = self.ffn(x)
x = self.add_norm3(x, ffn_output)
return x
The final Transformer model stacks multiple Encoder Layers (e.g., N=6) to form the Encoder and multiple Decoder Layers (e.g., N=6) to form the Decoder. nn.ModuleList
is convenient for this.
class Transformer(nn.Module):
def __init__(self, num_encoder_layers: int, num_decoder_layers: int,
d_model: int, num_heads: int, d_ff: int,
input_vocab_size: int, target_vocab_size: int,
max_seq_len: int, dropout: float = 0.1):
super().__init__()
self.encoder_embedding = nn.Embedding(input_vocab_size, d_model)
self.decoder_embedding = nn.Embedding(target_vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model, dropout, max_seq_len)
self.encoder_layers = nn.ModuleList([
EncoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_encoder_layers)
])
self.decoder_layers = nn.ModuleList([
DecoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_decoder_layers)
])
self.final_linear = nn.Linear(d_model, target_vocab_size)
self.d_model = d_model
self.dropout = nn.Dropout(dropout)
def create_padding_mask(self, seq: torch.Tensor, pad_token_idx: int = 0) -> torch.Tensor:
# seq shape: (batch_size, seq_len)
# Output mask shape: (batch_size, 1, 1, seq_len)
mask = (seq != pad_token_idx).unsqueeze(1).unsqueeze(2)
return mask
def create_look_ahead_mask(self, size: int) -> torch.Tensor:
# Creates an upper triangular matrix for masking future tokens
# Output mask shape: (1, 1, size, size)
mask = torch.triu(torch.ones(size, size), diagonal=1).bool()
# We want 0s where we mask, so we invert (if using 0 for masking in attention)
# Or return as is if attention function expects True where masked
# Assuming scaled_dot_product_attention uses `masked_fill(mask == 0, -1e9)`
# or `masked_fill(mask == True, -1e9)`, adapt accordingly.
# Let's assume the latter (True means mask)
return ~mask.unsqueeze(0).unsqueeze(0) # Making it False where we mask
def encode(self, src: torch.Tensor, src_mask: torch.Tensor) -> torch.Tensor:
# src: (batch_size, src_seq_len)
# src_mask: (batch_size, 1, 1, src_seq_len)
src_emb = self.encoder_embedding(src) * math.sqrt(self.d_model)
src_pos_emb = self.positional_encoding(src_emb)
enc_output = self.dropout(src_pos_emb)
for layer in self.encoder_layers:
enc_output = layer(enc_output, src_mask)
return enc_output # (batch_size, src_seq_len, d_model)
def decode(self, tgt: torch.Tensor, encoder_output: torch.Tensor,
look_ahead_mask: torch.Tensor, padding_mask: torch.Tensor) -> torch.Tensor:
# tgt: (batch_size, tgt_seq_len)
# encoder_output: (batch_size, src_seq_len, d_model)
# look_ahead_mask: (batch_size, 1, tgt_seq_len, tgt_seq_len)
# padding_mask: (batch_size, 1, 1, src_seq_len) # Used in enc-dec attention
tgt_emb = self.decoder_embedding(tgt) * math.sqrt(self.d_model)
tgt_pos_emb = self.positional_encoding(tgt_emb)
dec_output = self.dropout(tgt_pos_emb)
for layer in self.decoder_layers:
dec_output = layer(dec_output, encoder_output, look_ahead_mask, padding_mask)
return dec_output # (batch_size, tgt_seq_len, d_model)
def forward(self, src: torch.Tensor, tgt: torch.Tensor) -> torch.Tensor:
# src: (batch_size, src_seq_len)
# tgt: (batch_size, tgt_seq_len) usually shifted right for training
src_padding_mask = self.create_padding_mask(src)
tgt_padding_mask = self.create_padding_mask(tgt) # Also needed if tgt has padding
look_ahead_mask = self.create_look_ahead_mask(tgt.size(1)).to(tgt.device)
# Combine look-ahead mask and target padding mask for decoder self-attention
# Ensure both masks are broadcastable: (batch_size, 1, tgt_seq_len, tgt_seq_len)
combined_look_ahead_mask = torch.logical_and(tgt_padding_mask.transpose(-2, -1), look_ahead_mask)
encoder_output = self.encode(src, src_padding_mask)
decoder_output = self.decode(tgt, encoder_output, combined_look_ahead_mask, src_padding_mask)
# Final linear projection
output = self.final_linear(decoder_output) # (batch_size, tgt_seq_len, target_vocab_size)
return output # Often followed by Softmax outside the model during inference/loss calc
# Example Instantiation (parameters are illustrative)
transformer_model = Transformer(
num_encoder_layers=6, num_decoder_layers=6,
d_model=512, num_heads=8, d_ff=2048,
input_vocab_size=10000, target_vocab_size=12000,
max_seq_len=500, dropout=0.1
)
# Dummy input for shape check (assuming batch_size=2)
src_dummy = torch.randint(1, 10000, (2, 100)) # (batch, src_len)
tgt_dummy = torch.randint(1, 12000, (2, 120)) # (batch, tgt_len) - e.g. shifted target
# Move model and data to GPU if available
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# transformer_model.to(device)
# src_dummy = src_dummy.to(device)
# tgt_dummy = tgt_dummy.to(device)
output_logits = transformer_model(src_dummy, tgt_dummy)
print("Final output shape (logits):", output_logits.shape) # Should be (2, 120, 12000)
By assembling the Transformer from these fundamental PyTorch modules, you gain a concrete understanding of how information flows through the model and how attention mechanisms enable context-aware sequence processing. This component-based implementation also provides a flexible foundation for experimenting with architectural variations or adapting the model to different tasks. Remember that training such models effectively requires careful consideration of optimization, regularization, and data handling, topics covered in subsequent chapters.
© 2025 ApX Machine Learning