Masterclass
When scaling a Transformer model, two primary architectural dimensions you can adjust are its depth (the number of layers) and its width (the size of the hidden representations and feed-forward networks). As hinted by the scaling laws mentioned earlier, simply increasing the total parameter count isn't the whole story. How those parameters are allocated between depth and width significantly impacts model behavior, training dynamics, and computational efficiency. There isn't a universally optimal ratio; the best choice often depends on the specific task, dataset, available compute, and desired inference characteristics.
Let's examine the factors involved in this trade-off.
Adding more layers (increasing N, the number of encoder or decoder blocks) allows the model to potentially learn more complex, hierarchical representations of the input data. Each layer builds upon the transformations performed by the previous ones, enabling a deeper processing pipeline.
Potential Benefits:
Potential Drawbacks:
Increasing width typically involves enlarging the core embedding dimension (dmodel​) and often proportionally increasing the intermediate dimension in the position-wise feed-forward networks (dff​, frequently set to 4×dmodel​).
Potential Benefits:
Potential Drawbacks:
Empirical research on scaling laws, such as the work by Kaplan et al. (2020), suggests that for optimal performance under a fixed compute budget, model size, dataset size, and training compute should be scaled in tandem according to power laws. When scaling model size, studies indicate that increasing both depth and width yields better results than scaling only one dimension dramatically.
For instance, architectures like GPT-3 use a large number of layers (e.g., 96) and a large hidden dimension size (e.g., 12288). The exact ratio often evolves as models grow; comparing GPT-2 to GPT-3 shows increases in both depth and width, but not necessarily proportionally.
The decision often comes down to:
When using a framework like PyTorch, depth and width are typically controlled by specific parameters during model initialization. Consider the nn.TransformerEncoder
:
import torch
import torch.nn as nn
# Example Parameters
vocab_size = 30000
d_model = 512 # Model width (embedding dimension)
nhead = 8 # Number of attention heads (related to width)
num_encoder_layers = 6 # Model depth
dim_feedforward = 2048 # Width of the FFN intermediate layer
dropout = 0.1
# Input Embedding
embedding = nn.Embedding(vocab_size, d_model)
# Transformer Encoder Layer definition
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=nhead,
dim_feedforward=dim_feedforward,
dropout=dropout,
batch_first=True # Use batch_first=True for clarity
)
# Stacking the layers to create depth
transformer_encoder = nn.TransformerEncoder(
encoder_layer=encoder_layer,
num_layers=num_encoder_layers
)
# Example Usage (requires positional encoding, masking etc.)
# src = torch.randint(0, vocab_size, (32, 100)) # Batch x Sequence Length
# embedded_src = embedding(src)
# # ... add positional encoding ...
# output = transformer_encoder(embedded_src) # Pass through the encoder stack
# print(output.shape) # torch.Size([32, 100, 512])
In this snippet:
num_encoder_layers
directly controls the depth.d_model
, nhead
, and dim_feedforward
control the width.Experimenting with these values is fundamental to scaling Transformers. You might compare a model with num_encoder_layers=12, d_model=768
versus one with num_encoder_layers=24, d_model=512
, keeping factors like total parameter count or computational cost somewhat comparable to understand the trade-offs empirically.
Consider two hypothetical models with roughly similar parameter counts but different architectures:
This chart illustrates that a deeper, narrower model (A) might have lower memory requirements per layer but a longer sequential computation path compared to a shallower, wider model (B) with a similar parameter count, which shows the opposite trend.
In summary, choosing between depth and width is a balancing act. While deeper models offer the potential for sophisticated hierarchical learning, they can increase sequential computation time and pose optimization hurdles. Wider models provide more capacity per layer and may parallelize better internally but come with significantly higher memory and compute costs per layer. Modern large models typically scale both dimensions, guided by empirical results from scaling law research and constrained by the available hardware and training infrastructure. Careful experimentation is often required to find the most effective configuration for your specific goals.
© 2025 ApX Machine Learning