Masterclass
Having explored the attention mechanisms and positional encodings that form the heart of the Transformer, we now assemble these components into the larger structures: the encoder and decoder stacks. The original Transformer model proposed a stack of N=6 identical layers for both the encoder and the decoder, although modern large language models often employ significantly more layers.
The encoder's role is to process the input sequence and generate a sequence of context-rich representations. Each layer in the encoder stack receives a sequence of embeddings (or the output from the previous layer) and transforms it. A single encoder layer consists of two main sub-layers:
Crucially, residual connections are employed around each of the two sub-layers, followed by layer normalization. If x is the input to a sub-layer (e.g., Multi-Head Attention or FFN) and Sublayer(x) is the function implemented by the sub-layer itself, the output of the sub-layer block is LayerNorm(x+Sublayer(x)). This structure aids in training deeper models by facilitating gradient flow and stabilizing activations.
Flow within a single Transformer Encoder layer.
The output of one encoder layer serves as the input to the next identical encoder layer. The stacking allows the model to progressively build more complex representations of the input sequence, capturing dependencies at various levels.
Here's a simplified PyTorch representation of an encoder layer's structure:
import torch
import torch.nn as nn
class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.self_attn = nn.MultiheadAttention(
d_model,
num_heads,
dropout=dropout,
batch_first=True
)
self.feed_forward = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(), # Or GeLU, SwiGLU etc.
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, src, src_mask=None):
# --- Self-Attention Block ---
# Q, K, V are all from the input 'src'
attn_output, _ = self.self_attn(
src,
src,
src,
key_padding_mask=src_mask,
need_weights=False
)
# Residual connection and Layer Normalization
src = self.norm1(src + self.dropout(attn_output))
# --- Feed-Forward Block ---
ff_output = self.feed_forward(src)
# Residual connection and Layer Normalization
src = self.norm2(src + self.dropout(ff_output))
return src
class Encoder(nn.Module):
def __init__(self, num_layers, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.layers = nn.ModuleList([
EncoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
# Assuming input embeddings (incl. positional)
# are handled outside
def forward(self, src, src_mask=None):
for layer in self.layers:
src = layer(src, src_mask)
return src # Final output representation
The final output of the entire encoder stack is a sequence of vectors, one for each input token, intended to capture the contextual meaning of that token within the sequence. This output is then typically used by the decoder.
The decoder's role is often to generate an output sequence, one token at a time, based on the encoded input sequence and the tokens generated so far. Like the encoder, the decoder is also composed of a stack of N identical layers. Each decoder layer, however, has three main sub-layers:
Similar to the encoder, residual connections and layer normalization are applied after each of these three sub-layers: LayerNorm(x+Sublayer(x)).
Flow within a single Transformer Decoder layer, highlighting the three sub-layers and inputs.
A simplified PyTorch structure for a decoder layer illustrates this:
import torch
import torch.nn as nn
class DecoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.self_attn = nn.MultiheadAttention(
d_model,
num_heads,
dropout=dropout,
batch_first=True
)
self.encoder_attn = nn.MultiheadAttention(
d_model,
num_heads,
dropout=dropout,
batch_first=True
)
self.feed_forward = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(), # Or GeLU, SwiGLU etc.
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, tgt, memory, tgt_mask=None, memory_mask=None):
# --- Masked Self-Attention Block ---
# Q, K, V are all from the target input 'tgt'
# 'tgt_mask' prevents attending to future positions
attn_output, _ = self.self_attn(
tgt,
tgt,
tgt,
attn_mask=tgt_mask,
need_weights=False
)
# Residual connection and Layer Normalization
tgt = self.norm1(tgt + self.dropout(attn_output))
# --- Encoder-Decoder Attention Block ---
# Q from previous decoder layer ('tgt'), K & V from encoder output ('memory')
# 'memory_mask' (optional) masks padding in the source sequence
attn_output, _ = self.encoder_attn(
tgt,
memory,
memory,
key_padding_mask=memory_mask,
need_weights=False
)
# Residual connection and Layer Normalization
tgt = self.norm2(tgt + self.dropout(attn_output))
# --- Feed-Forward Block ---
ff_output = self.feed_forward(tgt)
# Residual connection and Layer Normalization
tgt = self.norm3(tgt + self.dropout(ff_output))
return tgt
class Decoder(nn.Module):
def __init__(self, num_layers, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.layers = nn.ModuleList([
DecoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
# Assuming target embeddings (incl. positional) are handled outside
def forward(self, tgt, memory, tgt_mask=None, memory_mask=None):
for layer in self.layers:
tgt = layer(tgt, memory, tgt_mask, memory_mask)
return tgt # Final output representation before linear/softmax
After the input passes through the entire decoder stack, the resulting sequence of vectors represents the predicted output tokens. To convert these vectors into probabilities over the target vocabulary, a final linear layer is typically applied, followed by a softmax function. The linear layer projects the decoder output vector (of dimension dmodel​) to the size of the vocabulary (V). The softmax function then converts these scores (logits) into probabilities, indicating the likelihood of each word in the vocabulary being the next token in the sequence.
P(yi​∣y<i​,x)=softmax(Linear(DecoderOutputi​))The combination of the encoder stack (processing the input) and the decoder stack (generating the output conditioned on the input and previous outputs) forms the complete Transformer architecture, capable of handling a wide range of sequence-to-sequence tasks. Understanding how these stacks are constructed from attention and feed-forward layers, along with normalization and residuals, is fundamental to building and modifying these powerful models.
© 2025 ApX Machine Learning