The feed-forward network (FFN) in a standard Transformer architecture, also known as a position-wise feed-forward network, is a simple yet computationally significant component. It typically consists of two linear transformations with a non-linear activation function in between, applied independently at each position. This FFN sublayer is where a substantial portion of the model's parameters and floating-point operations (FLOPs) are concentrated, making it the primary candidate for replacement with a Mixture of Experts (MoE) layer.
The goal is to substitute the dense FFN with a sparse MoE layer, creating a model with a much larger parameter count but a comparable computational cost for a single forward pass. This substitution is not just a theoretical exercise; it is a direct architectural modification designed to increase model capacity efficiently.
The FFN in a Transformer block acts as a modular unit. It accepts an input tensor of shape (batch_size, sequence_length, d_model) and produces an output tensor of the same shape. The MoE layer is engineered to be a "drop-in" replacement, meaning it must respect the same input and output tensor dimensions.
The diagram below illustrates this architectural change. On the left is a standard Transformer block, and on the right is a block where the FFN has been replaced by an MoE layer.
The substitution of a dense Feed-Forward Network with a sparse Mixture of Experts layer. The core attention mechanism and residual connections remain unchanged. The MoE layer internally contains a gating network and a set of independent expert networks.
While the external interface remains the same, the internal computation changes significantly. Instead of every token passing through the same set of weights in the FFN, the gating network within the MoE layer directs each token to a small subset of experts (often just one or two).
Let's examine how this change appears in code. A typical Transformer block implemented in a framework like PyTorch might look like this:
# A standard Transformer block
class TransformerBlock(nn.Module):
def __init__(self, d_model, n_heads, d_ff, dropout):
super().__init__()
self.attn = MultiHeadAttention(d_model, n_heads)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.ffn = FeedForward(d_model, d_ff) # The dense FFN
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Attention sublayer
attn_output, _ = self.attn(x, x, x)
x = self.norm1(x + self.dropout(attn_output))
# Feed-forward sublayer
ffn_output = self.ffn(x)
x = self.norm2(x + self.dropout(ffn_output))
return x
To integrate an MoE layer, we replace the FeedForward module with our MoELayer. A critical difference emerges in the forward pass: the MoE layer must also return its auxiliary loss, which is required for training to ensure balanced routing.
# A Transformer block with an MoE layer
class MoETransformerBlock(nn.Module):
def __init__(self, d_model, n_heads, num_experts, top_k, dropout):
super().__init__()
self.attn = MultiHeadAttention(d_model, n_heads)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
# The FFN is replaced by the MoE layer
self.moe_layer = MoELayer(d_model, num_experts, top_k)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Attention sublayer (remains the same)
attn_output, _ = self.attn(x, x, x)
x = self.norm1(x + self.dropout(attn_output))
# MoE sublayer
# The MoE layer returns the output and an auxiliary loss
moe_output, aux_loss = self.moe_layer(x)
x = self.norm2(x + self.dropout(moe_output))
# The auxiliary loss must be passed up to the training loop
return x, aux_loss
Notice the return signature of the forward method has changed from x to x, aux_loss. This change propagates up through the model's main forward method. The main training loop must be designed to collect these auxiliary losses from each MoE block, sum them, and add them to the primary loss function (e.g., cross-entropy) before backpropagation.
The substitution fundamentally alters the model's resource profile. Consider a Transformer where the standard FFN has 1 billion parameters. Replacing it with an MoE layer containing 8 experts, each identical in size to the original FFN, results in a model with approximately 8 billion parameters in that layer alone.
However, if the gating mechanism is set to top_k=1, the computational FLOPs remain almost the same as the original dense model. Each token is processed by only one expert, so the computation is equivalent to passing through a single FFN, plus the small overhead of the gating network.
| Attribute | Standard FFN | MoE Layer (N=8, k=1) |
|---|---|---|
| Parameters | ||
| FLOPs per Token |
This decoupling is the core benefit. We achieve a dramatic increase in the model's parameter count, which is a proxy for its knowledge capacity, without a corresponding increase in the computational cost of inference or training. This makes it possible to train models with trillions of parameters on existing hardware, a feat that would be impossible with a dense architecture. The primary new challenge becomes managing the memory required to store all experts, which is addressed by parallelism strategies discussed in later chapters.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with