At the heart of a Mixture of Experts model are the expert networks themselves. While the gating network acts as a traffic controller, the experts are the specialized destinations where the actual heavy computation occurs. A single, large feed-forward network (FFN) is responsible for processing every token in a standard Transformer. An MoE layer replaces this monolithic FFN with a collection of smaller, parallel FFNs, each termed an "expert."
An individual expert is architecturally simple. In most modern MoE implementations, such as those used in large language models, an expert is just a standard two-layer feed-forward network. It consists of:
d_model to d_ff).d_model.This structure is identical to the FFN block found in a standard Transformer layer. The difference is not in the complexity of a single expert, but in the multiplicity of them.
The internal architecture of a typical expert network. It mirrors the standard FFN block found in Transformer models.
The primary motivation for using experts is to decouple the growth in model parameters from the growth in computational cost (measured in FLOPs). Consider a standard Transformer with an FFN block. If you want to increase its capacity, you must make the FFN's hidden layer larger. This increases both the parameter count and the FLOPs for every single token that passes through it.
MoE models offer a more efficient scaling path. Instead of one massive FFN, you might use 8, 16, or even 64 experts, each having the same size as the original FFN.
Let's analyze the trade-off with a concrete example. Suppose a token is routed to only k=2 experts out of a total of N=8:
This is the core advantage of sparse models: you can build a model with a massive number of parameters, but each input only activates a small, computationally manageable fraction of them. The model is "large" in terms of stored knowledge but "small" in terms of active computation for any given forward pass.
Experts are not pre-programmed with specific functions. Their specialization is an emergent property of the training process, driven entirely by the gating network's routing decisions. As the model trains, the gating network learns to route tokens with similar characteristics to the same set of experts. The optimization process, guided by the overall task loss and the auxiliary load-balancing loss, encourages this behavior.
Over time, distinct patterns can be observed:
This learned division of labor allows the model to dedicate capacity to different facets of the data. Instead of a single FFN needing to be a jack-of-all-trades, each expert can become a master of a specific domain.
The gating network learns to route tokens to experts that have specialized for certain types of data, such as programming syntax or scientific terms.
To make this more concrete, here is what a single expert looks like in PyTorch. It is a simple nn.Sequential module containing the linear layers and activation function we described.
import torch.nn as nn
class Expert(nn.Module):
"""
A simple feed-forward network to be used as an expert in an MoE layer.
"""
def __init__(self, d_model: int, d_ff: int):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model)
)
def forward(self, x):
return self.net(x)
# Example instantiation
# d_model = 512 (embedding dimension)
# d_ff = 2048 (intermediate feed-forward dimension)
expert_network = Expert(d_model=512, d_ff=2048)
print(expert_network)
The output of instantiating this class is:
Expert(
(net): Sequential(
(0): Linear(in_features=512, out_features=2048, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=2048, out_features=512, bias=True)
)
)
This simple, reusable module forms the building block for the powerful scaling properties of MoE architectures. In the following sections, we will see how this Expert module is combined with a gating network to form a complete MoE layer.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with