We've examined the individual components of the Transformer encoder: multi-head self-attention for capturing contextual relationships, position-wise feed-forward networks for non-linear transformation, residual connections for gradient flow, and layer normalization for stabilizing activations. Now, let's assemble these parts into a functional EncoderBlock
. This practical exercise demonstrates how these sub-layers interact within a single encoder layer, forming the repeatable unit stacked to create the full encoder.
Our goal is to implement a standard Transformer encoder block module, typically using a framework like PyTorch or TensorFlow. For this example, we'll use PyTorch syntax.
Recall the data flow within a single encoder block:
Dropout is typically applied after the self-attention and feed-forward sub-layers, before the residual connection and normalization.
Data flow within a standard Transformer Encoder Block (Post-LN variant). Dashed lines indicate residual connections.
Let's define a PyTorch nn.Module
for the EncoderBlock
. We assume you have implementations for MultiHeadAttention
(as developed in Chapter 3) and PositionwiseFeedForward
(discussed earlier in this chapter) available.
import torch
import torch.nn as nn
# Assume MultiHeadAttention and PositionwiseFeedForward classes are defined elsewhere
# class MultiHeadAttention(nn.Module): ...
# class PositionwiseFeedForward(nn.Module): ...
class EncoderBlock(nn.Module):
"""
Implements a single Transformer Encoder Block.
This block follows the structure described in "Attention Is All You Need":
Input -> Self-Attention -> Add & Norm -> Feed-Forward -> Add & Norm -> Output
Uses Post-Layer Normalization (Add first, then Norm).
"""
def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout_prob: float = 0.1):
"""
Args:
d_model: The dimensionality of the input and output embeddings (model dimension).
num_heads: The number of attention heads.
d_ff: The inner dimension of the feed-forward network.
dropout_prob: The dropout probability applied after attention and FFN.
"""
super().__init__()
if d_model % num_heads != 0:
raise ValueError(f"'d_model' ({d_model}) must be divisible by 'num_heads' ({num_heads})")
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = PositionwiseFeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout_prob)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
"""
Passes the input through the encoder block.
Args:
x: Input tensor of shape (batch_size, seq_len, d_model).
mask: Optional mask for the self-attention layer. Typically used for padding.
Shape should be broadcastable to (batch_size, num_heads, seq_len, seq_len).
Returns:
Output tensor of shape (batch_size, seq_len, d_model).
"""
# 1. Multi-Head Self-Attention + Residual + LayerNorm
# Calculate attention output. Q=K=V=x for self-attention.
attn_output = self.self_attn(x, x, x, mask)
# Apply dropout to the attention output, then add residual connection (input x),
# and finally apply layer normalization.
x = self.norm1(x + self.dropout(attn_output))
# Store the output of the first sub-layer (attention + add&norm)
# This will be the input to the second residual connection.
sublayer1_output = x
# 2. Feed-Forward Network + Residual + LayerNorm
# Calculate feed-forward output
ff_output = self.feed_forward(sublayer1_output)
# Apply dropout to the FFN output, then add residual connection
# (using the output from the first sub-layer), and apply layer normalization.
x = self.norm2(sublayer1_output + self.dropout(ff_output))
return x
__init__
): We instantiate the necessary sub-modules: MultiHeadAttention
, PositionwiseFeedForward
, two LayerNorm
layers, and a Dropout
layer. The LayerNorm
layers normalize over the feature dimension (d_model
). A check ensures d_model
is divisible by num_heads
for the multi-head attention mechanism.forward
):
x
serves as Query, Key, and Value. The output attn_output
is regularized using dropout.x
to the (dropout-modified) attention output. This sum is then passed through the first layer normalization (self.norm1
). The result updates the variable x
.x
) is passed through the feed-forward network (self.feed_forward
).ff_output
) is regularized using dropout.self.norm1
, stored temporarily as sublayer1_output
in the refined code) to the (dropout-modified) feed-forward output.self.norm2
).Here's how you might create and use an EncoderBlock
, including placeholder definitions for the sub-modules to make the example runnable:
# Example parameters
batch_size = 4
seq_len = 50
d_model = 512 # Model dimension
num_heads = 8 # Number of attention heads
d_ff = 2048 # Feed-forward inner dimension
dropout_prob = 0.1
# Create dummy input tensor (batch_size, seq_len, d_model)
dummy_input = torch.rand(batch_size, seq_len, d_model)
# --- Assume MultiHeadAttention & PositionwiseFeedForward are defined ---
# This part is just to make the example runnable stand-alone.
# Replace with your actual implementations from previous chapters/sections.
class MultiHeadAttention(nn.Module):
# Placeholder implementation
def __init__(self, d_model, num_heads):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
if d_model % num_heads != 0:
raise ValueError("d_model must be divisible by num_heads")
self.head_dim = d_model // num_heads
# Normally, these would be projections for Q, K, V
self.fc_q = nn.Linear(d_model, d_model)
self.fc_k = nn.Linear(d_model, d_model)
self.fc_v = nn.Linear(d_model, d_model)
self.fc_out = nn.Linear(d_model, d_model) # Final output projection
def forward(self, query, key, value, mask=None):
# Simplified placeholder: project input, apply linear output layer
# In a real implementation, this performs scaled dot-product attention
# across multiple heads and concatenates results.
batch_size = query.shape[0]
# Just simulate the final output projection for shape consistency
# This ignores actual attention calculation for simplicity here
projected_q = self.fc_q(query) # Example projection
return self.fc_out(projected_q) # Return shape (batch_size, seq_len, d_model)
class PositionwiseFeedForward(nn.Module):
# Standard implementation
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(dropout) # Dropout often applied here too
self.linear2 = nn.Linear(d_ff, d_model)
def forward(self, x):
# (batch, seq_len, d_model) -> (batch, seq_len, d_ff) -> (batch, seq_len, d_model)
return self.linear2(self.dropout(self.relu(self.linear1(x))))
# ------------------------------------------------------------------------
# Instantiate the Encoder Block using the (placeholder) modules
encoder_block = EncoderBlock(d_model, num_heads, d_ff, dropout_prob)
# Pass the input through the block
output = encoder_block(dummy_input)
print(f"Input shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}")
# Verify shapes
# Expected output:
# Input shape: torch.Size([4, 50, 512])
# Output shape: torch.Size([4, 50, 512])
The output tensor retains the shape (batch_size, seq_len, d_model)
, which is essential. This allows the output of one EncoderBlock
to be fed directly as input into the next EncoderBlock
in the stack, enabling the construction of deep encoder models.
d_model
, num_heads
, d_ff
, and dropout_prob
significantly affects the model's capacity, computational cost, and generalization ability. The original Transformer used d_model=512
, num_heads=8
, d_ff=2048
, and dropout_prob=0.1
. Larger models often use larger values.forward
method structure. We will discuss the Pre-LN vs. Post-LN trade-offs in Chapter 6.This hands-on example provides a concrete implementation of the encoder block, combining the theoretical components discussed earlier. By understanding how to build this fundamental unit, you are well-equipped to construct the entire encoder stack and appreciate the data transformations occurring within the Transformer architecture.
© 2025 ApX Machine Learning