Now that we have examined the core components of the Transformer architecture, specifically Multi-Head Self-Attention and the Position-wise Feed-Forward Network, let's put them together in this practical exercise. We will implement a complete Transformer Encoder Layer using TensorFlow's Keras API. This layer represents a fundamental building block that can be stacked multiple times to form the full encoder component of a Transformer model.Recall that a single encoder layer performs two main operations:Calculates self-attention on the input sequence.Applies a fully connected feed-forward network independently to each position.Each of these operations is followed by a residual connection and layer normalization. Dropout is also applied within the layer for regularization.Defining the Encoder Layer ClassWe'll create a custom Keras layer by subclassing tf.keras.layers.Layer. This gives us the flexibility to define the internal structure and the forward pass computation precisely.import tensorflow as tf # Assume MultiHeadAttention and PositionWiseFeedForwardNetwork # classes are defined as covered in previous sections. # For completeness, here are simplified placeholder definitions: class MultiHeadAttention(tf.keras.layers.Layer): def __init__(self, d_model, num_heads, **kwargs): super().__init__(**kwargs) self.d_model = d_model self.num_heads = num_heads # Simplified: In reality, contains Dense layers for Q, K, V, and output print(f"Placeholder MHA: d_model={d_model}, num_heads={num_heads}") def call(self, v, k, q, mask=None): # Simplified: Returns input query as placeholder for attention output # Actual implementation would compute scaled dot-product attention print("Placeholder MHA Call") # Output shape: (batch_size, seq_len_q, d_model) return q # Placeholder class PositionWiseFeedForwardNetwork(tf.keras.layers.Layer): def __init__(self, d_model, dff, **kwargs): super().__init__(**kwargs) self.d_model = d_model self.dff = dff # Simplified: In reality, contains two Dense layers print(f"Placeholder FFN: d_model={d_model}, dff={dff}") def call(self, x): # Simplified: Returns input as placeholder print("Placeholder FFN Call") # Output shape: (batch_size, seq_len, d_model) return x # Placeholder # --- Actual Encoder Layer Implementation --- class TransformerEncoderLayer(tf.keras.layers.Layer): """ Implements a single Transformer Encoder layer with Multi-Head Attention, Feed Forward Network, Layer Normalization, and Dropout. """ def __init__(self, d_model, num_heads, dff, dropout_rate=0.1, **kwargs): """ Initializes the Transformer Encoder Layer. Args: d_model: Dimensionality of the input and output (embedding dimension). num_heads: Number of attention heads. dff: Dimensionality of the inner-layer in the Feed Forward Network. dropout_rate: Float between 0 and 1. Fraction of the units to drop. """ super().__init__(**kwargs) self.d_model = d_model self.num_heads = num_heads self.dff = dff self.dropout_rate = dropout_rate # Multi-Head Attention sub-layer self.mha = MultiHeadAttention(d_model, num_heads) # Position-wise Feed Forward Network sub-layer self.ffn = PositionWiseFeedForwardNetwork(d_model, dff) # Layer Normalization layers # Epsilon added for numerical stability self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6) self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6) # Dropout layers self.dropout1 = tf.keras.layers.Dropout(dropout_rate) self.dropout2 = tf.keras.layers.Dropout(dropout_rate) def call(self, x, training, mask=None): """ Forward pass for the Transformer Encoder Layer. Args: x: Input tensor. Shape: (batch_size, input_seq_len, d_model) training: Boolean indicating if the layer should behave in training mode (apply dropout) or inference mode. mask: Optional mask for the attention mechanism. Returns: Output tensor. Shape: (batch_size, input_seq_len, d_model) """ # 1. Multi-Head Attention (with residual connection and normalization) # Self-attention: query, key, and value are all the same input 'x' attn_output = self.mha(x, x, x, mask) # Shape: (batch_size, input_seq_len, d_model) # Apply dropout to the attention output # Dropout is only applied during training attn_output = self.dropout1(attn_output, training=training) # Add residual connection and apply layer normalization # out1 = x + attn_output out1 = self.layernorm1(x + attn_output) # Shape: (batch_size, input_seq_len, d_model) # 2. Feed Forward Network (with residual connection and normalization) ffn_output = self.ffn(out1) # Shape: (batch_size, input_seq_len, d_model) # Apply dropout to the FFN output ffn_output = self.dropout2(ffn_output, training=training) # Add residual connection and apply layer normalization # out2 = out1 + ffn_output out2 = self.layernorm2(out1 + ffn_output) # Shape: (batch_size, input_seq_len, d_model) return out2 def get_config(self): """Serializes the layer configuration.""" config = super().get_config() config.update({ 'd_model': self.d_model, 'num_heads': self.num_heads, 'dff': self.dff, 'dropout_rate': self.dropout_rate }) return config In the __init__ method, we instantiate the necessary sub-layers: MultiHeadAttention, PositionWiseFeedForwardNetwork, two LayerNormalization layers, and two Dropout layers. The hyperparameters d_model, num_heads, dff, and dropout_rate control the behavior and capacity of the layer.The call method defines the computation flow.The input x is passed through the MultiHeadAttention layer. Since this is self-attention within the encoder, the query, key, and value inputs to the attention mechanism are all the same tensor x. Any necessary padding mask is passed along.Dropout is applied to the attention output. Note the use of the training argument, which ensures dropout is only active during model training.A residual connection is added (x + attn_output), followed by layer normalization (self.layernorm1).The result (out1) is passed through the PositionWiseFeedForwardNetwork.Dropout is applied again (self.dropout2).Another residual connection is added (out1 + ffn_output), followed by the second layer normalization (self.layernorm2).The final output out2 has the same shape as the input x.We also include a get_config method, which is good practice for custom Keras layers, allowing the layer to be easily saved and loaded.Visualizing the Encoder Layer StructureThe following diagram illustrates the data flow within the TransformerEncoderLayer.digraph G { rankdir=TB; node [shape=box, style="rounded,filled", fillcolor="#a5d8ff", fontname="helvetica"]; edge [color="#495057"]; subgraph cluster_0 { label = "Transformer Encoder Layer"; bgcolor="#e9ecef"; style="rounded"; fontname="helvetica"; input [label="Input (x)\n(batch, seq_len, d_model)", fillcolor="#ffec99"]; mha [label="Multi-Head Attention"]; dropout1 [label="Dropout", fillcolor="#ffc9c9"]; add1 [label="+", shape=circle, fillcolor="#b2f2bb", width=0.3, height=0.3, fixedsize=true]; norm1 [label="Layer Normalization"]; ffn [label="Feed Forward Network"]; dropout2 [label="Dropout", fillcolor="#ffc9c9"]; add2 [label="+", shape=circle, fillcolor="#b2f2bb", width=0.3, height=0.3, fixedsize=true]; norm2 [label="Layer Normalization"]; output [label="Output\n(batch, seq_len, d_model)", fillcolor="#ffec99"]; input -> mha [label=" Q, K, V"]; mha -> dropout1; dropout1 -> add1 [label="attn_output"]; input -> add1 [style=dashed]; add1 -> norm1; norm1 -> ffn [label="out1"]; ffn -> dropout2; dropout2 -> add2 [label="ffn_output"]; norm1 -> add2 [style=dashed]; add2 -> norm2; norm2 -> output [label="out2"]; } }Data flow within a single Transformer Encoder Layer, showing the Multi-Head Attention and Feed-Forward sub-layers, each followed by Dropout, a residual connection (Add), and Layer Normalization.Instantiating and Testing the LayerLet's create an instance of our TransformerEncoderLayer and pass some sample data through it to verify its operation and output shape.# Define hyperparameters batch_size = 64 input_seq_len = 50 d_model = 512 # Embedding dimension num_heads = 8 # Number of attention heads dff = 2048 # Hidden layer size in FFN dropout_rate = 0.1 # Create a sample input tensor (e.g., sequence embeddings) # Replace with actual data in a real scenario sample_input = tf.random.uniform((batch_size, input_seq_len, d_model)) # Instantiate the encoder layer # Note: Using the actual MHA and FFN implementations is needed for real results # The placeholder versions used above will just print messages and pass data through. # Assuming you have functional MHA and FFN classes available: # encoder_layer = TransformerEncoderLayer(d_model, num_heads, dff, dropout_rate) # For demonstration with placeholders: print("Instantiating Encoder Layer with Placeholders:") encoder_layer = TransformerEncoderLayer(d_model, num_heads, dff, dropout_rate, name="my_encoder_layer") print("-" * 20) # Pass the sample input through the layer # Set training=False for inference mode (no dropout) print("Running Encoder Layer Call (training=False):") output_tensor = encoder_layer(sample_input, training=False) print("-" * 20) # Pass the sample input through the layer in training mode print("Running Encoder Layer Call (training=True):") output_tensor_train = encoder_layer(sample_input, training=True) print("-" * 20) # Check the output shape print(f"Input shape: {sample_input.shape}") print(f"Output shape (inference): {output_tensor.shape}") print(f"Output shape (training): {output_tensor_train.shape}") # Verify output shape matches input shape (excluding batch size potentially) assert sample_input.shape == output_tensor.shape assert sample_input.shape == output_tensor_train.shape print("\nEncoder layer created and tested successfully.") # You can also inspect the layer's configuration print("\nLayer Configuration:") print(encoder_layer.get_config())Running this code (assuming functional MultiHeadAttention and PositionWiseFeedForwardNetwork classes) will instantiate the encoder layer and process the sample input. The output shape should match the input shape (batch_size, input_seq_len, d_model), confirming that the layer processes the sequence while maintaining the dimensionality required for stacking multiple layers. You'll also see the placeholder messages if using the simplified versions shown earlier.This hands-on implementation provides a concrete understanding of how the different components integrate within a Transformer Encoder Layer. Typically, a full Transformer Encoder consists of multiple instances of this layer stacked sequentially, where the output of one layer becomes the input to the next.