Routing mechanisms, for example top-k and switch gating, perform a 'hard' assignment. A token is routed to a discrete, small set of experts, while all other experts are ignored for that specific token's computation. This hard selection is the source of the computational savings in sparse models, but it also introduces challenges like non-differentiability and the need for auxiliary load-balancing losses.
Soft MoE offers a different approach by replacing this discrete selection with a "soft," weighted combination of all experts. Instead of choosing which experts to use, the gating network determines a weight for every expert, and the final output is a weighted sum of the outputs from all experts. This makes the entire MoE layer fully differentiable and elegantly sidesteps the training instabilities associated with hard gating.
In a Soft MoE layer, the gating network operates similarly to a standard router by producing a logit for each expert. However, instead of using these logits to select the top-k experts, we apply a softmax function across them. This converts the logits into a set of positive weights that sum to one, effectively forming a probability distribution over the experts.
The final output for an input token x is not the output of a few selected experts, but a linear combination of the outputs from all N experts. The contribution of each expert Ei(x) is scaled by its corresponding softmax weight wi.
The mathematical formulation is direct. Given an input x, the gating network G computes a vector of logits h(x). The weights w are then calculated as:
w=softmax(h(x))The final output y of the Soft MoE layer is the weighted sum:
y=i=1∑Nwi⋅Ei(x)This formulation might look familiar. It closely resembles the attention mechanism, where a query attends to a set of keys to produce weights, which are then used to compute a weighted sum of values. In Soft MoE, you can think of the token's representation as the query and the experts as the keys and values.
The diagrams below illustrate the difference between the data flow in a hard-gating MoE and a Soft MoE.
In hard routing, the gating network selects a discrete expert (Expert 1), and all computation flows through it. Other experts remain inactive for this token.
In Soft MoE, the gating network computes a weight for every expert. The final output is a weighted combination of all expert outputs.
The primary advantage of Soft MoE is its elegant solution to the training challenges of sparse models.
However, this elegance comes at a significant and often prohibitive cost.
Given its computational demands, a "pure" Soft MoE is rarely used in large-scale language models where computational efficiency is a primary design goal. Its formulation serves more as a theoretical benchmark and an analytical tool.
However, the core idea of soft, differentiable assignments has influenced the design of more practical, hybrid systems. For example, some approaches might use a top-k router to select a small subset of experts and then compute a soft, weighted combination within that subset. This can provide some of the training stability of soft routing while retaining most of the computational benefits of sparsity.
Understanding Soft MoE is important because it clearly delineates the trade-off between mathematical simplicity in training and the computational sparsity required for scaling. It represents one end of the spectrum in MoE design, where training stability is maximized at the expense of inference efficiency. This provides a valuable contrast to mechanisms like Switch Transformers, which occupy the other end of the spectrum by prioritizing computational efficiency above all else.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with