While learned gating networks offer dynamic routing, they introduce their own set of complexities, including trainable parameters, potential instability, and the need for auxiliary loss functions. An alternative approach sidesteps these issues by removing the learning process from the router. Hash-based routing is a deterministic method that assigns tokens to experts using a fixed, non-learned function, trading the potential for intelligent specialization for absolute simplicity and stability.
At its core, hash-based routing is simple. Instead of passing a token's representation through a linear layer to compute routing logits, we apply a standard hash function to some property of the token. The resulting hash value is then mapped to an expert index using the modulo operator.
The process for a single token x is:
This operation is computationally trivial and requires no trainable weights. The same input token feature will always map to the same expert, making the routing decision static and predictable throughout training and inference.
A diagram of the hash-based routing workflow. A token's feature is passed through a hash function and a modulo operator to deterministically select an expert.
Opting for a non-learned router might seem counterintuitive, as it discards the model's ability to learn intelligent routing patterns. However, this approach offers significant engineering advantages.
The most obvious benefit is the complete elimination of the gating network. This means:
A well-chosen hash function naturally distributes tokens uniformly across the available experts. This statistical uniformity means that, over a large batch of tokens, each expert is expected to receive a similar number of assignments.
This property directly resolves the load imbalance problem that plagues learned routers. Consequently, the auxiliary load-balancing loss is no longer necessary. The total loss function simplifies to just the primary task loss (e.g., cross-entropy):
Ltotal=LtaskBy removing the auxiliary loss and its associated hyperparameter, we also eliminate a common source of training instability, such as the router z-loss problem discussed in the previous chapter.
The primary drawback of hash-based routing is its impact on model performance. Learned gating allows the model to develop semantic specialization; for example, one expert might become adept at handling punctuation and grammar, while another focuses on scientific terminology. The router learns to send the right token to the right specialist.
Hash-based routing breaks this connection. Since assignments are pseudo-random, each expert is forced to become a generalist. It must learn to process a random subset of the entire data distribution, preventing the emergence of fine-grained specialization. This typically results in lower model quality (e.g., higher perplexity or lower accuracy) compared to a sparse MoE model with a well-trained learned router.
The chart below illustrates this trade-off. While hash-based MoE provides a better performance-to-computation ratio than a dense model, it generally underperforms a standard MoE with learned routing.
Comparison of model quality for dense, learned MoE, and hash-based MoE models. Hash-based routing offers a middle ground, improving on dense models but not matching the performance of learned, specialized routing.
In practice, hash-based routing serves as an important experimental baseline. It helps isolate the performance gains that come from learned, sparse activation versus those that come simply from having more parameters. If a complex, learned MoE model cannot outperform a hash-based equivalent, it often indicates problems with the training setup or the routing mechanism itself.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with