Alright, let's bring together the components we've meticulously examined. In previous sections and chapters, we explored the building blocks: Input Embeddings, Positional Encoding, Multi-Head Attention (including the self-attention variant and scaled dot-product attention), Add & Norm layers, Position-wise Feed-Forward networks, and the structure of individual Encoder and Decoder layers. We also discussed how to prepare data, create batches with masks, choose loss functions, and select optimization strategies.
Now, it's time to assemble these parts into a complete Transformer model structure. We'll define a main Transformer
class that orchestrates the data flow through the encoder and decoder stacks. This exercise solidifies the understanding of how these individual modules interact within the larger architecture.
We'll assume you have access to implementations of the following components (perhaps from earlier practical exercises or provided helper modules):
InputEmbeddings
: Converts input token IDs to dense vectors.PositionalEncoding
: Adds positional information to embeddings.EncoderLayer
: Contains Multi-Head Self-Attention and Feed-Forward sub-layers.DecoderLayer
: Contains Masked Multi-Head Self-Attention, Encoder-Decoder Attention, and Feed-Forward sub-layers.MultiHeadAttention
: The core attention mechanism.PositionwiseFeedForward
: The fully connected feed-forward network.OutputProjection
: A final linear layer to map decoder outputs to vocabulary probabilities.Let's structure the main Transformer
class. We'll use a PyTorch-like pseudocode style for clarity, focusing on the architecture rather than specific framework intricacies.
The main class will initialize all the necessary layers and define the forward
pass that takes source and target sequences (along with their masks) and produces the final output logits.
import torch
import torch.nn as nn
import copy
# Assume EncoderLayer, DecoderLayer, InputEmbeddings, PositionalEncoding,
# OutputProjection are defined elsewhere based on previous sections/chapters.
class Transformer(nn.Module):
def __init__(self,
num_encoder_layers: int,
num_decoder_layers: int,
d_model: int, # Embedding dimension
n_head: int, # Number of attention heads
src_vocab_size: int,
tgt_vocab_size: int,
d_ff: int, # Dimension of feed-forward layer
dropout: float = 0.1,
max_seq_len: int = 512):
super().__init__()
self.d_model = d_model
# Embeddings and Positional Encoding
self.src_embedding = InputEmbeddings(d_model, src_vocab_size)
self.tgt_embedding = InputEmbeddings(d_model, tgt_vocab_size)
self.positional_encoding = PositionalEncoding(d_model, dropout, max_len=max_seq_len)
# --- Encoder Stack ---
# Create one EncoderLayer instance
encoder_layer = EncoderLayer(d_model, n_head, d_ff, dropout)
# Use clone to create N independent layers
self.encoder_stack = nn.ModuleList([copy.deepcopy(encoder_layer) for _ in range(num_encoder_layers)])
# --- Decoder Stack ---
# Create one DecoderLayer instance
decoder_layer = DecoderLayer(d_model, n_head, d_ff, dropout)
# Use clone to create N independent layers
self.decoder_stack = nn.ModuleList([copy.deepcopy(decoder_layer) for _ in range(num_decoder_layers)])
# Final Output Layer
self.output_projection = OutputProjection(d_model, tgt_vocab_size)
# Initialize parameters (important for stable training)
self._initialize_parameters()
def _initialize_parameters(self):
# Use Xavier uniform initialization for linear layers
for p in self.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
def encode(self, src: torch.Tensor, src_mask: torch.Tensor) -> torch.Tensor:
"""Processes the source sequence through the encoder stack."""
# Apply embedding and positional encoding
src_emb = self.positional_encoding(self.src_embedding(src))
# Pass through each encoder layer
encoder_output = src_emb
for layer in self.encoder_stack:
encoder_output = layer(encoder_output, src_mask)
return encoder_output
def decode(self, tgt: torch.Tensor, encoder_output: torch.Tensor,
tgt_mask: torch.Tensor, src_tgt_mask: torch.Tensor) -> torch.Tensor:
"""Processes the target sequence and encoder output through the decoder stack."""
# Apply embedding and positional encoding
tgt_emb = self.positional_encoding(self.tgt_embedding(tgt))
# Pass through each decoder layer
decoder_output = tgt_emb
for layer in self.decoder_stack:
decoder_output = layer(decoder_output, encoder_output, tgt_mask, src_tgt_mask)
return decoder_output
def forward(self,
src: torch.Tensor,
tgt: torch.Tensor,
src_mask: torch.Tensor, # Masks padding in source
tgt_mask: torch.Tensor, # Masks future tokens & padding in target
src_tgt_mask: torch.Tensor # Masks padding in source for encoder-decoder attention
) -> torch.Tensor:
"""
The main forward pass of the Transformer model.
src: (batch_size, src_seq_len)
tgt: (batch_size, tgt_seq_len)
src_mask: (batch_size, 1, 1, src_seq_len) # For self-attention in encoder
tgt_mask: (batch_size, 1, tgt_seq_len, tgt_seq_len) # For masked self-attention in decoder
src_tgt_mask: (batch_size, 1, 1, src_seq_len) # For encoder-decoder attention in decoder
"""
# 1. Pass source sequence through the encoder
encoder_output = self.encode(src, src_mask) # (batch_size, src_seq_len, d_model)
# 2. Pass target sequence and encoder output through the decoder
decoder_output = self.decode(tgt, encoder_output, tgt_mask, src_tgt_mask) # (batch_size, tgt_seq_len, d_model)
# 3. Project decoder output to vocabulary space
logits = self.output_projection(decoder_output) # (batch_size, tgt_seq_len, tgt_vocab_size)
return logits
Transformer
class acts as a container. It doesn't implement the attention or feed-forward logic itself but delegates these tasks to the EncoderLayer
and DecoderLayer
modules. This promotes code reuse and clarity.copy.deepcopy
when creating the encoder and decoder stacks. While the architecture of each layer within a stack is identical, the parameters (weights and biases) are typically not shared between layers. Each layer learns its own transformations.nn.ModuleList
(or an equivalent structure in other frameworks) to hold the stack of encoder and decoder layers. This ensures that all layers are properly registered as sub-modules, and their parameters are included when training the model.encode
and decode
methods. This makes the main forward
method cleaner and easier to follow. It also allows using just the encoder or decoder part if needed for specific applications (like using encoder outputs for sentence embeddings).forward
method explicitly requires the different masks we discussed earlier (src_mask
, tgt_mask
, src_tgt_mask
). These are essential for handling padding and preventing the decoder from attending to future tokens. Generating these masks correctly during data preparation is a significant step.The following diagram illustrates the high-level flow within the assembled Transformer
class during the forward
pass:
High-level data flow within the assembled Transformer model, showing inputs, embedding, encoder/decoder stacks, and the final output projection. Masks are provided at relevant stages.
With the class defined, you can create an instance by specifying the hyperparameters:
# Example Hyperparameters
num_layers = 6
d_model = 512
n_head = 8
d_ff = 2048 # Typically 4 * d_model
src_vocab = 10000 # Example source vocabulary size
tgt_vocab = 12000 # Example target vocabulary size
dropout_rate = 0.1
max_len = 500
# Instantiate the model
transformer_model = Transformer(num_encoder_layers=num_layers,
num_decoder_layers=num_layers,
d_model=d_model,
n_head=n_head,
src_vocab_size=src_vocab,
tgt_vocab_size=tgt_vocab,
d_ff=d_ff,
dropout=dropout_rate,
max_seq_len=max_len)
print(f"Transformer model instantiated with {num_layers} layers, d_model={d_model}.")
# You could potentially add a check here with dummy inputs:
# dummy_src = torch.randint(0, src_vocab, (2, 10)) # Batch size 2, seq len 10
# dummy_tgt = torch.randint(0, tgt_vocab, (2, 12)) # Batch size 2, seq len 12
# ... create dummy masks ...
# output = transformer_model(dummy_src, dummy_tgt, dummy_src_mask, dummy_tgt_mask, dummy_src_tgt_mask)
# print(f"Output shape: {output.shape}") # Should be (2, 12, tgt_vocab)
This assembled Transformer
class provides the complete structure. The next logical step involves setting up the training loop: feeding batches of data (source sequences, target sequences, and the corresponding masks), calculating the loss (e.g., cross-entropy) between the model's output logits and the actual target sequences, and using an optimizer (like Adam with specific learning rate scheduling) to update the model's parameters via backpropagation.
© 2025 ApX Machine Learning