Having explored the mechanics of attention in the previous section, we now assemble these components into the core building blocks of the Transformer architecture: the Encoder and Decoder layers. Originally proposed in the paper "Attention Is All You Need" (Vaswani et al., 2017), Transformers have become foundational for numerous tasks in natural language processing and beyond. We will focus on implementing the structure of these blocks using TensorFlow and Keras, assuming familiarity with subclassing tf.keras.layers.Layer
and the concepts of self-attention.
The Transformer architecture typically consists of a stack of identical Encoder layers followed by a stack of identical Decoder layers. Let's construct each type of layer.
An Encoder layer processes the entire input sequence simultaneously. Its primary goal is to generate a rich representation of the input, encoding contextual information for each element in the sequence. Each Encoder layer has two main sub-layers:
Residual connections are employed around each of the two sub-layers, followed by layer normalization. The output of each sub-layer is LayerNorm(x + Sublayer(x))
, where Sublayer(x)
is the function implemented by the sub-layer itself.
Structure of a single Transformer Encoder Layer. Dashed lines indicate residual connections.
The FFN consists of two linear transformations with an activation function in between. A common choice for activation is ReLU, although GELU is also frequently used. The dimensionality of the inner layer (dff
) is typically larger than the model's dimensionality (d_model
).
Here's a simple implementation using Keras layers:
import tensorflow as tf
class PositionWiseFeedForward(tf.keras.layers.Layer):
def __init__(self, d_model, dff, activation='relu', **kwargs):
"""
Initializes the Position-wise Feed-Forward Network.
Args:
d_model: Dimensionality of the model (input/output).
dff: Dimensionality of the inner feed-forward layer.
activation: Activation function for the inner layer ('relu' or 'gelu').
**kwargs: Additional keyword arguments for the base Layer class.
"""
super().__init__(**kwargs)
self.d_model = d_model
self.dff = dff
# Use kernel_initializer for better weight initialization
initializer = tf.keras.initializers.GlorotUniform()
self.dense_1 = tf.keras.layers.Dense(dff, activation=activation,
kernel_initializer=initializer,
name='ffn_dense_1')
self.dense_2 = tf.keras.layers.Dense(d_model,
kernel_initializer=initializer,
name='ffn_dense_2')
def call(self, x):
"""
Forward pass for the FFN.
Args:
x: Input tensor of shape (batch_size, seq_len, d_model).
Returns:
Output tensor of shape (batch_size, seq_len, d_model).
"""
x = self.dense_1(x) # Shape: (batch_size, seq_len, dff)
x = self.dense_2(x) # Shape: (batch_size, seq_len, d_model)
return x
def get_config(self):
config = super().get_config()
config.update({
'd_model': self.d_model,
'dff': self.dff,
'activation': self.dense_1.activation.__name__ # Store activation name
})
return config
Now, let's combine the Multi-Head Attention (assuming we have a MultiHeadAttention
layer defined as in the previous section) and the FFN into a single EncoderLayer
. We also incorporate dropout for regularization, applied after each sub-layer before the residual connection and normalization.
# Assume MultiHeadAttention layer is defined elsewhere or imported
# from your_attention_module import MultiHeadAttention
class EncoderLayer(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads, dff, rate=0.1, **kwargs):
"""
Initializes a single Transformer Encoder Layer.
Args:
d_model: Dimensionality of the model.
num_heads: Number of attention heads.
dff: Dimensionality of the inner feed-forward layer.
rate: Dropout rate.
**kwargs: Additional keyword arguments for the base Layer class.
"""
super().__init__(**kwargs)
self.d_model = d_model
self.num_heads = num_heads
self.dff = dff
self.rate = rate
self.mha = MultiHeadAttention(d_model, num_heads, name='multi_head_attention')
self.ffn = PositionWiseFeedForward(d_model, dff, name='position_wise_ffn')
self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6, name='layer_norm_1')
self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6, name='layer_norm_2')
self.dropout1 = tf.keras.layers.Dropout(rate, name='dropout_1')
self.dropout2 = tf.keras.layers.Dropout(rate, name='dropout_2')
def call(self, x, training, mask):
"""
Forward pass for the Encoder Layer.
Args:
x: Input tensor of shape (batch_size, input_seq_len, d_model).
training: Boolean, indicating if the layer should behave in training mode (apply dropout).
mask: Padding mask for self-attention, shape (batch_size, 1, 1, input_seq_len).
Returns:
Output tensor of shape (batch_size, input_seq_len, d_model).
Attention weights (optional, if MultiHeadAttention returns them).
"""
# Multi-Head Attention block
# Note: MHA typically takes query, key, value, mask. For self-attention, query=key=value=x.
attn_output, attn_weights = self.mha(x, x, x, mask, training=training) # (batch_size, input_seq_len, d_model)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(x + attn_output) # (batch_size, input_seq_len, d_model)
# Feed-Forward block
ffn_output = self.ffn(out1) # (batch_size, input_seq_len, d_model)
ffn_output = self.dropout2(ffn_output, training=training)
out2 = self.layernorm2(out1 + ffn_output) # (batch_size, input_seq_len, d_model)
# Return output and optionally attention weights for inspection
# Depending on your MultiHeadAttention implementation, you might just return out2
return out2 # Or return out2, attn_weights
def get_config(self):
config = super().get_config()
config.update({
'd_model': self.d_model,
'num_heads': self.num_heads,
'dff': self.dff,
'rate': self.rate
})
return config
Note the use of the training
argument, which is important for controlling dropout behavior during training versus inference. The mask
argument is used to prevent attention to padding tokens in the input sequence.
The Decoder layer shares similarities with the Encoder layer but introduces a third sub-layer to attend to the output of the Encoder stack. Its purpose is to generate the output sequence one element at a time, conditioned on the encoded input and the previously generated output elements.
Each Decoder layer has three main sub-layers:
Again, residual connections and layer normalization are applied after each sub-layer.
Structure of a single Transformer Decoder Layer. Dashed lines indicate residual connections. Note the two types of attention and the required masks.
Two types of masks are commonly used in the Decoder:
T
, this is often combined with the padding mask and results in a mask of shape (batch_size, 1, T, T)
.(batch_size, 1, 1, input_seq_len)
. The self-attention layer might also need a padding mask if the target sequence itself has padding, which would be combined with the look-ahead mask.The implementation combines these three sub-layers with residual connections, layer normalization, and dropout.
# Assume MultiHeadAttention and PositionWiseFeedForward are defined
class DecoderLayer(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads, dff, rate=0.1, **kwargs):
"""
Initializes a single Transformer Decoder Layer.
Args:
d_model: Dimensionality of the model.
num_heads: Number of attention heads.
dff: Dimensionality of the inner feed-forward layer.
rate: Dropout rate.
**kwargs: Additional keyword arguments for the base Layer class.
"""
super().__init__(**kwargs)
self.d_model = d_model
self.num_heads = num_heads
self.dff = dff
self.rate = rate
# Masked Self-Attention
self.mha1 = MultiHeadAttention(d_model, num_heads, name='masked_multi_head_attention')
# Cross-Attention (Encoder-Decoder)
self.mha2 = MultiHeadAttention(d_model, num_heads, name='cross_multi_head_attention')
self.ffn = PositionWiseFeedForward(d_model, dff, name='position_wise_ffn')
self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6, name='layer_norm_1')
self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6, name='layer_norm_2')
self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6, name='layer_norm_3')
self.dropout1 = tf.keras.layers.Dropout(rate, name='dropout_1')
self.dropout2 = tf.keras.layers.Dropout(rate, name='dropout_2')
self.dropout3 = tf.keras.layers.Dropout(rate, name='dropout_3')
def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
"""
Forward pass for the Decoder Layer.
Args:
x: Input tensor (target sequence embedding) of shape (batch_size, target_seq_len, d_model).
enc_output: Output from the encoder stack, shape (batch_size, input_seq_len, d_model).
training: Boolean, indicating if the layer should behave in training mode.
look_ahead_mask: Combined look-ahead and padding mask for self-attention,
shape (batch_size, 1, target_seq_len, target_seq_len).
padding_mask: Padding mask for the encoder output in cross-attention,
shape (batch_size, 1, 1, input_seq_len).
Returns:
Output tensor of shape (batch_size, target_seq_len, d_model).
Attention weights from self-attention (optional).
Attention weights from cross-attention (optional).
"""
# Masked Multi-Head Self-Attention block
# Q=K=V=x for self-attention
attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask, training=training)
attn1 = self.dropout1(attn1, training=training)
out1 = self.layernorm1(x + attn1)
# Multi-Head Cross-Attention block
# Q=out1 (from decoder), K=V=enc_output (from encoder)
attn2, attn_weights_block2 = self.mha2(out1, enc_output, enc_output, padding_mask, training=training)
attn2 = self.dropout2(attn2, training=training)
out2 = self.layernorm2(out1 + attn2) # (batch_size, target_seq_len, d_model)
# Position-wise Feed-Forward block
ffn_output = self.ffn(out2) # (batch_size, target_seq_len, d_model)
ffn_output = self.dropout3(ffn_output, training=training)
out3 = self.layernorm3(out2 + ffn_output) # (batch_size, target_seq_len, d_model)
# Return output and optionally attention weights
# Depending on MHA implementation, you might just return out3
return out3 # Or return out3, attn_weights_block1, attn_weights_block2
def get_config(self):
config = super().get_config()
config.update({
'd_model': self.d_model,
'num_heads': self.num_heads,
'dff': self.dff,
'rate': self.rate
})
return config
In this DecoderLayer
, note how mha1
(self-attention) uses x
for query, key, and value and applies the look_ahead_mask
, while mha2
(cross-attention) uses the output of the first sub-layer (out1
) as the query and the enc_output
as the key and value, applying the padding_mask
corresponding to the encoder's input.
A full Transformer model stacks multiple Encoder layers (N
times) and multiple Decoder layers (N
times). The output of the final Encoder layer becomes the enc_output
fed into each Decoder layer's cross-attention mechanism. Positional encodings are typically added to the input embeddings before they enter the first Encoder or Decoder layer to provide information about the position of tokens in the sequence, as the self-attention mechanism itself doesn't inherently capture order.
By implementing these EncoderLayer
and DecoderLayer
components using TensorFlow's Keras API, you gain modular and reusable blocks. These can be combined to construct the full Encoder and Decoder stacks, forming the backbone of a complete Transformer model suitable for various sequence transduction problems. The next steps typically involve adding input embeddings, positional encodings, and a final linear layer (plus softmax for classification/generation) to complete the model.
© 2025 ApX Machine Learning