Okay, let's translate the theory of the Transformer encoder layer into practice. In the preceding sections, we dissected the components: multi-head self-attention for capturing contextual relationships, position-wise feed-forward networks for transforming representations, and the essential Add & Norm steps for stabilization and gradient flow. Now, we'll assemble these pieces into a functional encoder layer using Python and a deep learning framework like PyTorch or TensorFlow.
This practical exercise assumes you have access to implementations of Multi-Head Attention (perhaps built in the previous chapter's practical section) and understand basic class definition and tensor operations within your chosen framework. We'll focus on structuring the EncoderLayer
module itself.
Before we write the code, let's quickly recall the sub-layers within a single encoder layer:
Dropout is also typically applied after the multi-head attention output and after the FFN output to prevent overfitting during training.
This is a straightforward component. It consists of two linear transformations with a non-linear activation function in between. Often, the first linear layer expands the dimension, and the second compresses it back to the original model dimension (dmodel). A common expansion factor is 4.
Let's represent this conceptually (using PyTorch-like syntax):
import torch
import torch.nn as nn
class PositionWiseFeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
"""
Initializes the Position-wise Feed-Forward Network.
Args:
d_model (int): Dimensionality of the input and output.
d_ff (int): Dimensionality of the inner layer.
dropout (float): Dropout probability.
"""
super().__init__()
self.linear_1 = nn.Linear(d_model, d_ff)
self.activation = nn.ReLU() # Or nn.GELU()
self.dropout = nn.Dropout(dropout)
self.linear_2 = nn.Linear(d_ff, d_model)
def forward(self, x):
"""
Forward pass through the FFN.
Args:
x (torch.Tensor): Input tensor of shape (batch_size, seq_len, d_model).
Returns:
torch.Tensor: Output tensor of shape (batch_size, seq_len, d_model).
"""
x = self.linear_1(x)
x = self.activation(x)
x = self.dropout(x) # Dropout often applied after activation
x = self.linear_2(x)
return x
This PositionWiseFeedForward
module takes a tensor of shape (batch_size, seq_len, d_model)
and returns a tensor of the same shape, having applied the transformations independently at each sequence position.
Now we combine the Multi-Head Self-Attention (which we'll assume is available as a module named MultiHeadAttention
), the PositionWiseFeedForward
network defined above, Layer Normalization, residual connections, and dropout into a single EncoderLayer
.
class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
"""
Initializes a single Transformer Encoder Layer.
Args:
d_model (int): The dimensionality of the input/output features (embeddings).
num_heads (int): The number of attention heads.
d_ff (int): The inner dimension of the feed-forward network.
dropout (float): The dropout probability.
"""
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads) # Assumed implementation
self.feed_forward = PositionWiseFeedForward(d_model, d_ff, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, mask):
"""
Forward pass through the Encoder Layer.
Args:
x (torch.Tensor): Input tensor of shape (batch_size, seq_len, d_model).
mask (torch.Tensor): Attention mask (optional, for padding).
Shape typically (batch_size, 1, 1, seq_len) or similar
depending on MultiHeadAttention implementation.
Returns:
torch.Tensor: Output tensor of shape (batch_size, seq_len, d_model).
"""
# 1. Multi-Head Self-Attention + Add & Norm
attn_output = self.self_attn(query=x, key=x, value=x, mask=mask)
# Residual connection 1: Add input 'x' to attention output
# Apply dropout to the attention output before adding
x = self.norm1(x + self.dropout1(attn_output))
# 2. Position-wise Feed-Forward Network + Add & Norm
ff_output = self.feed_forward(x)
# Residual connection 2: Add input to FFN ('x') to FFN output
# Apply dropout to the FFN output before adding
x = self.norm2(x + self.dropout2(ff_output))
return x
In this EncoderLayer
, the input x
first goes through the multi-head self-attention mechanism. The output of the attention is then passed through dropout (dropout1
), added back to the original input x
(the first residual connection), and the sum is normalized (norm1
). This normalized output then serves as the input to the position-wise feed-forward network. The output of the FFN goes through dropout (dropout2
), is added to the input that went into the FFN (the second residual connection), and this sum is normalized (norm2
) to produce the final output of the encoder layer.
The following diagram illustrates the data flow within the EncoderLayer
we just defined.
Data flow within a single Transformer Encoder Layer, showing the Multi-Head Attention and Feed-Forward sub-layers, each followed by Dropout, a residual connection (Add), and Layer Normalization.
This structure, repeated multiple times (typically 6 or 12 times in the original Transformer paper), forms the complete Encoder stack. Each layer takes the output of the previous layer as its input, allowing the model to build increasingly complex representations of the input sequence.
You now have a practical implementation of a core Transformer component. In the next chapter, we'll look at how to train these models effectively.
© 2025 ApX Machine Learning