The Transformer encoder utilizes several individual components, including multi-head self-attention for capturing contextual relationships, position-wise feed-forward networks for non-linear transformation, residual connections for gradient flow, and layer normalization for stabilizing activations. These parts are assembled into a functional EncoderBlock. Constructing an EncoderBlock demonstrates how these sub-layers interact within a single encoder layer, forming the repeatable unit stacked to create the full encoder.
Our goal is to implement a standard Transformer encoder block module, typically using a framework like PyTorch or TensorFlow. For this example, we'll use PyTorch syntax.
Recall the data flow within a single encoder block:
Dropout is typically applied after the self-attention and feed-forward sub-layers, before the residual connection and normalization.
Data flow within a standard Transformer Encoder Block (Post-LN variant). Dashed lines indicate residual connections.
Let's define a PyTorch nn.Module for the EncoderBlock. We assume you have implementations for MultiHeadAttention (as developed in Chapter 3) and PositionwiseFeedForward (discussed earlier in this chapter) available.
import torch
import torch.nn as nn
# Assume MultiHeadAttention and PositionwiseFeedForward classes are defined elsewhere
# class MultiHeadAttention(nn.Module): ...
# class PositionwiseFeedForward(nn.Module): ...
class EncoderBlock(nn.Module):
"""
Implements a single Transformer Encoder Block.
This block follows the structure described in "Attention Is All You Need":
Input -> Self-Attention -> Add & Norm -> Feed-Forward -> Add & Norm -> Output
Uses Post-Layer Normalization (Add first, then Norm).
"""
def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout_prob: float = 0.1):
"""
Args:
d_model: The dimensionality of the input and output embeddings (model dimension).
num_heads: The number of attention heads.
d_ff: The inner dimension of the feed-forward network.
dropout_prob: The dropout probability applied after attention and FFN.
"""
super().__init__()
if d_model % num_heads != 0:
raise ValueError(f"'d_model' ({d_model}) must be divisible by 'num_heads' ({num_heads})")
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = PositionwiseFeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout_prob)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
"""
Passes the input through the encoder block.
Args:
x: Input tensor of shape (batch_size, seq_len, d_model).
mask: Optional mask for the self-attention layer. Typically used for padding.
Shape should be broadcastable to (batch_size, num_heads, seq_len, seq_len).
Returns:
Output tensor of shape (batch_size, seq_len, d_model).
"""
# 1. Multi-Head Self-Attention + Residual + LayerNorm
# Calculate attention output. Q=K=V=x for self-attention.
attn_output = self.self_attn(x, x, x, mask)
# Apply dropout to the attention output, then add residual connection (input x),
# and finally apply layer normalization.
x = self.norm1(x + self.dropout(attn_output))
# Store the output of the first sub-layer (attention + add&norm)
# This will be the input to the second residual connection.
sublayer1_output = x
# 2. Feed-Forward Network + Residual + LayerNorm
# Calculate feed-forward output
ff_output = self.feed_forward(sublayer1_output)
# Apply dropout to the FFN output, then add residual connection
# (using the output from the first sub-layer), and apply layer normalization.
x = self.norm2(sublayer1_output + self.dropout(ff_output))
return x
__init__): We instantiate the necessary sub-modules: MultiHeadAttention, PositionwiseFeedForward, two LayerNorm layers, and a Dropout layer. The LayerNorm layers normalize over the feature dimension (d_model). A check ensures d_model is divisible by num_heads for the multi-head attention mechanism.forward):
x serves as Query, Key, and Value. The output attn_output is regularized using dropout.x to the (dropout-modified) attention output. This sum is then passed through the first layer normalization (self.norm1). The result updates the variable x.x) is passed through the feed-forward network (self.feed_forward).ff_output) is regularized using dropout.self.norm1, stored temporarily as sublayer1_output in the refined code) to the (dropout-modified) feed-forward output.self.norm2).Here's how you might create and use an EncoderBlock, including placeholder definitions for the sub-modules to make the example runnable:
# Example parameters
batch_size = 4
seq_len = 50
d_model = 512 # Model dimension
num_heads = 8 # Number of attention heads
d_ff = 2048 # Feed-forward inner dimension
dropout_prob = 0.1
# Create dummy input tensor (batch_size, seq_len, d_model)
dummy_input = torch.rand(batch_size, seq_len, d_model)
# --- Assume MultiHeadAttention & PositionwiseFeedForward are defined ---
# This part is just to make the example runnable stand-alone.
# Replace with your actual implementations from previous chapters/sections.
class MultiHeadAttention(nn.Module):
# Placeholder implementation
def __init__(self, d_model, num_heads):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
if d_model % num_heads != 0:
raise ValueError("d_model must be divisible by num_heads")
self.head_dim = d_model // num_heads
# Normally, these would be projections for Q, K, V
self.fc_q = nn.Linear(d_model, d_model)
self.fc_k = nn.Linear(d_model, d_model)
self.fc_v = nn.Linear(d_model, d_model)
self.fc_out = nn.Linear(d_model, d_model) # Final output projection
def forward(self, query, key, value, mask=None):
# Simplified placeholder: project input, apply linear output layer
# In a real implementation, this performs scaled dot-product attention
# across multiple heads and concatenates results.
batch_size = query.shape[0]
# Just simulate the final output projection for shape consistency
# This ignores actual attention calculation for simplicity here
projected_q = self.fc_q(query) # Example projection
return self.fc_out(projected_q) # Return shape (batch_size, seq_len, d_model)
class PositionwiseFeedForward(nn.Module):
# Standard implementation
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(dropout) # Dropout often applied here too
self.linear2 = nn.Linear(d_ff, d_model)
def forward(self, x):
# (batch, seq_len, d_model) -> (batch, seq_len, d_ff) -> (batch, seq_len, d_model)
return self.linear2(self.dropout(self.relu(self.linear1(x))))
# ------------------------------------------------------------------------
# Instantiate the Encoder Block using the (placeholder) modules
encoder_block = EncoderBlock(d_model, num_heads, d_ff, dropout_prob)
# Pass the input through the block
output = encoder_block(dummy_input)
print(f"Input shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}")
# Verify shapes
# Expected output:
# Input shape: torch.Size([4, 50, 512])
# Output shape: torch.Size([4, 50, 512])
The output tensor retains the shape (batch_size, seq_len, d_model), which is essential. This allows the output of one EncoderBlock to be fed directly as input into the next EncoderBlock in the stack, enabling the construction of deep encoder models.
d_model, num_heads, d_ff, and dropout_prob significantly affects the model's capacity, computational cost, and generalization ability. The original Transformer used d_model=512, num_heads=8, d_ff=2048, and dropout_prob=0.1. Larger models often use larger values.forward method structure. We will discuss the Pre-LN vs. Post-LN trade-offs in Chapter 6.This hands-on example provides a concrete implementation of the encoder block, combining the theoretical components discussed earlier. By understanding how to build this fundamental unit, you are well-equipped to construct the entire encoder stack and appreciate the data transformations occurring within the Transformer architecture.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with