The Transformer encoder utilizes several individual components, including multi-head self-attention for capturing contextual relationships, position-wise feed-forward networks for non-linear transformation, residual connections for gradient flow, and layer normalization for stabilizing activations. These parts are assembled into a functional EncoderBlock. Constructing an EncoderBlock demonstrates how these sub-layers interact within a single encoder layer, forming the repeatable unit stacked to create the full encoder.Our goal is to implement a standard Transformer encoder block module, typically using a framework like PyTorch or TensorFlow. For this example, we'll use PyTorch syntax.Encoder Block Structure RecapRecall the data flow within a single encoder block:The input sequence embeddings (combined with positional encodings) first pass through a multi-head self-attention mechanism.A residual connection adds the original input to the output of the attention sub-layer.The result is then processed by layer normalization.This normalized output goes into a position-wise feed-forward network.Another residual connection adds the input to the feed-forward sub-layer to its output.Finally, a second layer normalization is applied.Dropout is typically applied after the self-attention and feed-forward sub-layers, before the residual connection and normalization.digraph G { rankdir=TB; node [shape=box, style="filled", fillcolor="#e9ecef", fontname="sans-serif"]; edge [fontname="sans-serif"]; inp [label="Input (x)"]; mha [label="Multi-Head\nSelf-Attention", fillcolor="#a5d8ff"]; drop1 [label="Dropout", fillcolor="#ffec99"]; add1 [label="Add", shape=circle, fillcolor="#b2f2bb", width=0.5]; norm1 [label="LayerNorm", fillcolor="#bac8ff"]; ffn [label="Position-wise\nFeed-Forward", fillcolor="#a5d8ff"]; drop2 [label="Dropout", fillcolor="#ffec99"]; add2 [label="Add", shape=circle, fillcolor="#b2f2bb", width=0.5]; norm2 [label="LayerNorm", fillcolor="#bac8ff"]; out [label="Output"]; inp -> mha; mha -> drop1; drop1 -> add1; inp -> add1 [style=dashed, color="#868e96"]; add1 -> norm1; norm1 -> ffn; ffn -> drop2; drop2 -> add2; norm1 -> add2 [style=dashed, color="#868e96"]; add2 -> norm2; norm2 -> out; }Data flow within a standard Transformer Encoder Block (Post-LN variant). Dashed lines indicate residual connections.Implementation in PyTorchLet's define a PyTorch nn.Module for the EncoderBlock. We assume you have implementations for MultiHeadAttention (as developed in Chapter 3) and PositionwiseFeedForward (discussed earlier in this chapter) available.import torch import torch.nn as nn # Assume MultiHeadAttention and PositionwiseFeedForward classes are defined elsewhere # class MultiHeadAttention(nn.Module): ... # class PositionwiseFeedForward(nn.Module): ... class EncoderBlock(nn.Module): """ Implements a single Transformer Encoder Block. This block follows the structure described in "Attention Is All You Need": Input -> Self-Attention -> Add & Norm -> Feed-Forward -> Add & Norm -> Output Uses Post-Layer Normalization (Add first, then Norm). """ def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout_prob: float = 0.1): """ Args: d_model: The dimensionality of the input and output embeddings (model dimension). num_heads: The number of attention heads. d_ff: The inner dimension of the feed-forward network. dropout_prob: The dropout probability applied after attention and FFN. """ super().__init__() if d_model % num_heads != 0: raise ValueError(f"'d_model' ({d_model}) must be divisible by 'num_heads' ({num_heads})") self.self_attn = MultiHeadAttention(d_model, num_heads) self.feed_forward = PositionwiseFeedForward(d_model, d_ff) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout_prob) def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor: """ Passes the input through the encoder block. Args: x: Input tensor of shape (batch_size, seq_len, d_model). mask: Optional mask for the self-attention layer. Typically used for padding. Shape should be broadcastable to (batch_size, num_heads, seq_len, seq_len). Returns: Output tensor of shape (batch_size, seq_len, d_model). """ # 1. Multi-Head Self-Attention + Residual + LayerNorm # Calculate attention output. Q=K=V=x for self-attention. attn_output = self.self_attn(x, x, x, mask) # Apply dropout to the attention output, then add residual connection (input x), # and finally apply layer normalization. x = self.norm1(x + self.dropout(attn_output)) # Store the output of the first sub-layer (attention + add&norm) # This will be the input to the second residual connection. sublayer1_output = x # 2. Feed-Forward Network + Residual + LayerNorm # Calculate feed-forward output ff_output = self.feed_forward(sublayer1_output) # Apply dropout to the FFN output, then add residual connection # (using the output from the first sub-layer), and apply layer normalization. x = self.norm2(sublayer1_output + self.dropout(ff_output)) return xUnderstanding the CodeInitialization (__init__): We instantiate the necessary sub-modules: MultiHeadAttention, PositionwiseFeedForward, two LayerNorm layers, and a Dropout layer. The LayerNorm layers normalize over the feature dimension (d_model). A check ensures d_model is divisible by num_heads for the multi-head attention mechanism.Forward Pass (forward):The first part handles the multi-head self-attention sub-layer. The input x serves as Query, Key, and Value. The output attn_output is regularized using dropout.The first residual connection adds the original input x to the (dropout-modified) attention output. This sum is then passed through the first layer normalization (self.norm1). The result updates the variable x.The second part handles the position-wise feed-forward sub-layer. The output from the first normalization (x) is passed through the feed-forward network (self.feed_forward).Its output (ff_output) is regularized using dropout.The second residual connection adds the input to the feed-forward layer (which is the output of self.norm1, stored temporarily as sublayer1_output in the refined code) to the (dropout-modified) feed-forward output.This sum is passed through the second layer normalization (self.norm2).The final tensor, having passed through both sub-layers with residuals and normalization, is returned.Instantiation and Usage ExampleHere's how you might create and use an EncoderBlock, including placeholder definitions for the sub-modules to make the example runnable:# Example parameters batch_size = 4 seq_len = 50 d_model = 512 # Model dimension num_heads = 8 # Number of attention heads d_ff = 2048 # Feed-forward inner dimension dropout_prob = 0.1 # Create dummy input tensor (batch_size, seq_len, d_model) dummy_input = torch.rand(batch_size, seq_len, d_model) # --- Assume MultiHeadAttention & PositionwiseFeedForward are defined --- # This part is just to make the example runnable stand-alone. # Replace with your actual implementations from previous chapters/sections. class MultiHeadAttention(nn.Module): # Placeholder implementation def __init__(self, d_model, num_heads): super().__init__() self.d_model = d_model self.num_heads = num_heads if d_model % num_heads != 0: raise ValueError("d_model must be divisible by num_heads") self.head_dim = d_model // num_heads # Normally, these would be projections for Q, K, V self.fc_q = nn.Linear(d_model, d_model) self.fc_k = nn.Linear(d_model, d_model) self.fc_v = nn.Linear(d_model, d_model) self.fc_out = nn.Linear(d_model, d_model) # Final output projection def forward(self, query, key, value, mask=None): # Simplified placeholder: project input, apply linear output layer # In a real implementation, this performs scaled dot-product attention # across multiple heads and concatenates results. batch_size = query.shape[0] # Just simulate the final output projection for shape consistency # This ignores actual attention calculation for simplicity here projected_q = self.fc_q(query) # Example projection return self.fc_out(projected_q) # Return shape (batch_size, seq_len, d_model) class PositionwiseFeedForward(nn.Module): # Standard implementation def __init__(self, d_model, d_ff, dropout=0.1): super().__init__() self.linear1 = nn.Linear(d_model, d_ff) self.relu = nn.ReLU() self.dropout = nn.Dropout(dropout) # Dropout often applied here too self.linear2 = nn.Linear(d_ff, d_model) def forward(self, x): # (batch, seq_len, d_model) -> (batch, seq_len, d_ff) -> (batch, seq_len, d_model) return self.linear2(self.dropout(self.relu(self.linear1(x)))) # ------------------------------------------------------------------------ # Instantiate the Encoder Block using the (placeholder) modules encoder_block = EncoderBlock(d_model, num_heads, d_ff, dropout_prob) # Pass the input through the block output = encoder_block(dummy_input) print(f"Input shape: {dummy_input.shape}") print(f"Output shape: {output.shape}") # Verify shapes # Expected output: # Input shape: torch.Size([4, 50, 512]) # Output shape: torch.Size([4, 50, 512])The output tensor retains the shape (batch_size, seq_len, d_model), which is essential. This allows the output of one EncoderBlock to be fed directly as input into the next EncoderBlock in the stack, enabling the construction of deep encoder models.Configuration and VariationsHyperparameters: The choice of d_model, num_heads, d_ff, and dropout_prob significantly affects the model's capacity, computational cost, and generalization ability. The original Transformer used d_model=512, num_heads=8, d_ff=2048, and dropout_prob=0.1. Larger models often use larger values.Normalization Placement (Pre-LN): This implementation uses Post-LN (Layer Normalization after the residual addition). An alternative, Pre-LN, applies Layer Normalization before the self-attention and feed-forward sub-layers, with the residual connection added afterward. Pre-LN often leads to more stable training, especially for deeper models, and requires modifying the forward method structure. We will discuss the Pre-LN vs. Post-LN trade-offs in Chapter 6.This hands-on example provides a concrete implementation of the encoder block, combining the theoretical components discussed earlier. By understanding how to build this fundamental unit, you are well-equipped to construct the entire encoder stack and appreciate the data transformations occurring within the Transformer architecture.