The gating network is the control center of a Mixture of Experts layer. Its sole responsibility is to inspect each incoming token and decide which of the available experts are best suited to process it. This routing decision is not static; it is a learned function that adapts during training, allowing the model to develop sophisticated, data-driven pathways for information. Unlike a simple switch, the gating network provides a set of continuous-valued scores that can be interpreted as the confidence in assigning a token to each expert.Architecture and Mathematical FormulationAt its core, the gating network is a simple feed-forward network, typically just a single linear layer applied to the token's input embedding, followed by a softmax function. This design keeps the routing mechanism computationally lightweight, which is important since it is executed for every token in every MoE layer.Let's consider a single input token represented by a vector $x \in \mathbb{R}^d$, where $d$ is the model's hidden dimension. The gating network has a trainable weight matrix $W_g \in \mathbb{R}^{d \times N}$, where $N$ is the total number of experts in the layer.The first step is to compute the "logits" or scores for each expert by projecting the input token onto the gating weights:$$ H = x \cdot W_g $$The result, $H$, is a vector of length $N$, where each element $H_i$ represents the raw score for assigning the token $x$ to expert $i$. To convert these scores into a probability distribution, we apply the softmax function:$$ g(x) = \text{softmax}(H) = \frac{\exp(H)}{\sum_{j=1}^{N} \exp(H_j)} $$The output vector, $g(x)$, contains the gating weights. Each component $g(x)_i$ is a value between 0 and 1, and the sum of all components is 1. This vector represents a soft assignment of the token across all experts.digraph G { rankdir=TB; splines=ortho; node [shape=box, style="rounded,filled", fontname="Arial", margin="0.2,0.1"]; edge [fontname="Arial", fontsize=10]; subgraph cluster_input { label="Input Token"; style=filled; color="#e9ecef"; x [label="Token Embedding (x)", shape=box, style="rounded,filled", fillcolor="#a5d8ff"]; } subgraph cluster_gating { label="Gating Network"; style=filled; color="#e9ecef"; gating_logic [label="Linear Layer (W_g)\n+ Softmax", shape=box, style="rounded,filled", fillcolor="#d0bfff"]; } subgraph cluster_experts { label="Expert Networks"; style=filled; color="#e9ecef"; node [shape=box, style="rounded,filled", fillcolor="#96f2d7"]; E1 [label="Expert 1"]; E2 [label="Expert 2"]; E_dots [label="...", shape=none, fillcolor=none]; EN [label="Expert N"]; } output [label="Gating Weights g(x)", shape=box, style="rounded,filled", fillcolor="#ffd8a8"]; x -> gating_logic; gating_logic -> output [label=" Scores per expert"]; {rank=same; E1; E2; E_dots; EN;} output -> E1 [label=" g(x)₁", style=dashed, color="#495057"]; output -> E2 [label=" g(x)₂", style=dashed, color="#495057"]; output -> EN [label=" g(x)ₙ", style=dashed, color="#495057"]; }The gating network processes an input token embedding to produce a vector of weights, one for each expert.Enforcing Sparsity with Top-k RoutingWhile the softmax function produces a dense vector of probabilities, a primary goal of MoE is to achieve sparse computation, meaning only a fraction of the experts are activated for any given token. To enforce this, we apply a top-k selection mechanism. Instead of using all $N$ experts, the gating network selects the k experts with the highest scores from the logits vector $H$. The value of k is a critical hyperparameter, often set to 1 or 2.For example, with k=2, the router identifies the two experts with the highest logits. All other experts are ignored for this specific token, ensuring that the computational cost is proportional to k and not the total number of experts $N$.This selection process introduces a challenge: the TopK operation is non-differentiable, which complicates training via backpropagation. In practice, the gradients are passed only through the connections to the chosen top-k experts. The gating weights $g(x)_i$ for the selected experts are then used to scale their outputs. A common approach is to re-normalize these k weights so they sum to 1. The final output of the MoE layer for input $x$ is the weighted sum of the outputs from the selected experts:$$ y(x) = \sum_{i \in \text{TopK}(H)} \frac{\exp(H_i)}{\sum_{j \in \text{TopK}(H)} \exp(H_j)} \cdot E_i(x) $$Training the RouterThe gating network's weight matrix, $W_g$, is not fixed; it is trained jointly with the rest of the model. The overall training loss of the network backpropagates through the selected k experts and, importantly, back to the gating network itself.This end-to-end training process teaches the router its function. If routing a certain type of token to a particular expert consistently reduces the model's loss, the gradients will update $W_g$ to increase the probability of that assignment in the future. This feedback loop is what drives the experts to specialize. The gating network learns to identify features in the token embeddings that are predictive of which expert will be most effective, effectively becoming a learned traffic controller that optimizes the flow of information through the model's specialized pathways.However, this process is not without its own set of challenges. A naive training setup can lead to imbalanced routing, where the gating network favors a small number of experts, leaving the others untrained. This issue is addressed by auxiliary losses, which we will examine in the "Load Balancing and Auxiliary Losses" section.