Replacing FFNs with MoE Layers in Transformers

The feed-forward network (FFN) in a standard Transformer architecture, also known as a position-wise feed-forward network, is a simple yet computationally significant component. It typically consists of two linear transformations with a non-linear activation function in between, applied independently at each position. This FFN sublayer is where a substantial portion of the model's parameters and floating-point operations (FLOPs) are concentrated, making it the primary candidate for replacement with a Mixture of Experts (MoE) layer.

The goal is to substitute the dense FFN with a sparse MoE layer, creating a model with a much larger parameter count but a comparable computational cost for a single forward pass. This substitution is not just a theoretical exercise; it is a direct architectural modification designed to increase model capacity efficiently.

The Anatomy of the Swap

The FFN in a Transformer block acts as a modular unit. It accepts an input tensor of shape (batch_size, sequence_length, d_model) and produces an output tensor of the same shape. The MoE layer is engineered to be a "drop-in" replacement, meaning it must respect the same input and output tensor dimensions.

The diagram below illustrates this architectural change. On the left is a standard Transformer block, and on the right is a block where the FFN has been replaced by an MoE layer.

The substitution of a dense Feed-Forward Network with a sparse Mixture of Experts layer. The core attention mechanism and residual connections remain unchanged. The MoE layer internally contains a gating network and a set of independent expert networks.

While the external interface remains the same, the internal computation changes significantly. Instead of every token passing through the same set of weights in the FFN, the gating network within the MoE layer directs each token to a small subset of experts (often just one or two).

Implementation in Practice

Let's examine how this change appears in code. A typical Transformer block implemented in a framework like PyTorch might look like this:

# A standard Transformer block
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ff) # The dense FFN
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Attention sublayer
        attn_output, _ = self.attn(x, x, x)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed-forward sublayer
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))
        return x

To integrate an MoE layer, we replace the FeedForward module with our MoELayer. A critical difference emerges in the forward pass: the MoE layer must also return its auxiliary loss, which is required for training to ensure balanced routing.

# A Transformer block with an MoE layer
class MoETransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, num_experts, top_k, dropout):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        # The FFN is replaced by the MoE layer
        self.moe_layer = MoELayer(d_model, num_experts, top_k)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Attention sublayer (remains the same)
        attn_output, _ = self.attn(x, x, x)
        x = self.norm1(x + self.dropout(attn_output))

        # MoE sublayer
        # The MoE layer returns the output and an auxiliary loss
        moe_output, aux_loss = self.moe_layer(x)
        x = self.norm2(x + self.dropout(moe_output))

        # The auxiliary loss must be passed up to the training loop
        return x, aux_loss

Notice the return signature of the forward method has changed from x to x, aux_loss. This change propagates up through the model's main forward method. The main training loop must be designed to collect these auxiliary losses from each MoE block, sum them, and add them to the primary loss function (e.g., cross-entropy) before backpropagation.

Analyzing the Trade-offs

The substitution fundamentally alters the model's resource profile. For example, a Transformer where the standard FFN has 1 billion parameters. Replacing it with an MoE layer containing 8 experts, each identical in size to the original FFN, results in a model with approximately 8 billion parameters in that layer alone.

However, if the gating mechanism is set to top_k=1, the computational FLOPs remain almost the same as the original dense model. Each token is processed by only one expert, so the computation is equivalent to passing through a single FFN, plus the small overhead of the gating network.

Attribute	Standard FFN	MoE Layer (N=8, k=1)
Parameters	$P_{FFN}$	$\approx N \times P_{FFN} = 8 \times P_{FFN}$
FLOPs per Token	$F_{FFN}$	$\approx k \times F_{FFN} = 1 \times F_{FFN}$

This decoupling is the core benefit. We achieve a dramatic increase in the model's parameter count, which is a proxy for its knowledge capacity, without a corresponding increase in the computational cost of inference or training. This makes it possible to train models with trillions of parameters on existing hardware, a feat that would be impossible with a dense architecture. The primary new challenge becomes managing the memory required to store all $N$ experts, which is addressed by parallelism strategies discussed in later chapters.

Was this section helpful?

References

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017 arXiv preprint arXiv:1701.06538 DOI: 10.48550/arXiv.1701.06538 - Introduces the sparsely-gated Mixture of Experts (MoE) layer, its architecture, and the auxiliary loss for load balancing.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems 30 (NIPS 2017), Vol. 30 (Curran Associates, Inc.) DOI: 10.55917/cbdd4778 - The paper introducing the Transformer architecture, detailing its components including the standard feed-forward network (FFN) that MoE layers replace.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, Noam Shazeer, 2022 Journal of Machine Learning Research, Vol. 23 (JMLR, Inc.) DOI: 10.48550/arXiv.2101.03961 - Demonstrates the application of MoE layers to large-scale Transformer models by replacing FFNs, detailing the efficiency gains and parameter scaling.