The gating network in a Mixture of Experts model dynamically routes each token to a select number of experts. This dynamic, data-dependent routing creates an engineering challenge: while the routing is flexible, GPU computation is most efficient when operating on tensors with fixed, predictable shapes. If one expert is suddenly assigned 1,000 tokens and another only 10, we cannot simply allocate different amounts of memory and compute on the fly within a single batch.
To resolve this, MoE implementations pre-allocate a fixed-size buffer, or "capacity," for each expert. Every expert is provisioned to handle a specific maximum number of tokens per batch. The capacity factor is the hyperparameter that controls the size of this buffer. It directly dictates the trade-off between computational waste and information loss.
The capacity for each expert is calculated relative to a perfectly uniform distribution of tokens. If you have tokens in a batch and experts, a perfectly balanced system would send tokens to each expert. The capacity factor, , is a multiplier applied to this ideal average.
The formula for an expert's token capacity is:
A capacity factor of means each expert buffer can hold exactly the average number of tokens. A factor of provides a 25% buffer, allowing each expert to process up to 25% more tokens than the batch average.
The choice of capacity factor creates a direct and unavoidable trade-off. Because routing is dynamic, some experts will inevitably be more popular than others for a given batch. This leads to two opposing outcomes.
Low Capacity Factor (e.g., ): If an expert's buffer is too small, it will overflow when it receives more tokens than it can handle. The tokens that arrive after the buffer is full are considered "dropped." These dropped tokens do not get processed by an expert. Instead, they pass directly through the MoE layer's residual connection. This saves computation but at the cost of model performance, as the model loses the opportunity to apply specialized knowledge to those tokens.
High Capacity Factor (e.g., ): A large buffer significantly reduces the risk of dropping tokens. However, it forces the system to allocate memory and computational resources for a capacity that is rarely used. If an expert is allocated a capacity of 200 tokens but only receives 50, the computation and memory for the 150 unused slots are wasted. This padding increases training costs and slows down execution.
The diagram below illustrates this process. Input tokens are routed to three experts, each with a fixed capacity. Expert 2 overflows, causing one token to be dropped. Expert 3 is underutilized, resulting in wasted (padded) capacity.
An illustration of token routing with a fixed capacity. Expert 2 receives three tokens but can only process two, leading to a dropped token. Expert 3 receives only one token, leaving one slot as wasted computation.
The relationship between the capacity factor, dropped tokens, and wasted computation is clear when plotted. As you increase the capacity factor, the percentage of dropped tokens falls sharply, but the amount of wasted computation rises linearly.
The capacity factor directly controls the balance between model quality (fewer dropped tokens) and computational efficiency (less waste).
There is no single "best" value for the capacity factor; it is a critical hyperparameter that depends on your specific model, data, and training objectives. However, here are some common practices:
Start with a Common Baseline: Most research and production systems use a capacity factor between 1.25 and 2.0. A value of is a reasonable starting point for many applications.
Monitor Dropped Tokens: During training, log the percentage of tokens dropped per batch. A high percentage (e.g., consistently above 1%) is a signal that your model's learning is being impaired. This indicates that the capacity factor should be increased.
Profile for Wasted Compute: If your dropped token rate is nearly zero but your training is slow, your capacity factor may be unnecessarily high. Profiling tools can help you measure the amount of padding, which represents wasted computation. Reducing the capacity factor can improve throughput.
The capacity factor does not operate in a vacuum. Its effectiveness is tightly connected to the auxiliary load-balancing loss discussed in Chapter 1. A well-designed load-balancing loss encourages the gating network to distribute tokens more evenly across the experts.
Ultimately, managing the capacity factor is an important skill in training MoE models. It requires careful tuning and monitoring to find the right balance that maximizes model quality while respecting your computational budget.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with