The preceding chapters established the mechanics of individual Mixture of Experts layers. We now address their practical application by incorporating them into modern neural network architectures. The most common use case for MoEs is to increase model capacity without a proportional rise in computational cost, and this is typically achieved by replacing standard feed-forward networks (FFNs) with MoE layers.
This chapter provides a technical guide to this integration process. We will cover:
A central benefit of MoE is the decoupling of total parameters from the computation required for a single forward pass. An MoE model may contain N experts, but for any given token, the gating network routes it to a small subset of k experts, where k≪N. The total computational cost is therefore a function of k, while the model's total parameter count is a function of N. This relationship can be expressed as:
Total Parameters∝N×Parameters per Expert Computational FLOPs∝k×FLOPs per ExpertThe chapter concludes with a hands-on exercise where you will modify a standard Transformer implementation to use sparse MoE layers, putting these architectural principles into practice.
5.1 Replacing FFNs with MoE Layers in Transformers
5.2 Placement of MoE Layers: Frequency and Location
5.3 MoE in Vision Transformers (ViT)
5.4 MoE in Multi-modal Models
5.5 Architectural Variants and their Properties
5.6 Analyzing Parameter vs. FLOPs Trade-offs
5.7 Practice: Modifying a Transformer to use MoE
© 2026 ApX Machine LearningEngineered with