Mixture of Experts: Sparse Gating, Switch Transformer, and Efficient Scaling
Sparse MoE routes each token to top-k of N expert FFN layers; Switch Transformer (Fedus et al., 2022) uses k=1 routing to scale to 1.6T parameters, activating ~7B per token — 7× pre-training speedup over dense T5-11B.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Switch Transformer total parameters | 1.6 | trillion | Fedus et al. (2022); each token activates ~7B parameters via top-1 routing |
| Switch Transformer parameters activated per token | ~7 | billion | Only 1 expert per MoE layer activated; ~0.4% of total parameters per token |
| Pre-training speedup (Switch vs dense T5-11B) | 7× | faster | Same compute budget; Switch reaches equivalent perplexity 7× faster in steps |
| Typical top-k routing | k = 1 or 2 | experts per token | k=1 (Switch); k=2 (GShard, most other MoE); k>2 shows diminishing returns |
| Expert capacity factor | 1.0–1.5 | Maximum tokens per expert = capacity_factor × (tokens/n_experts); overflow tokens skip MoE layer |
Mixture of Experts (MoE) is a conditional computation technique that scales model capacity without proportionally scaling per-token compute. By activating only a fraction of model parameters for each input token, MoE layers enable enormously large models to be trained on the same compute budget as smaller dense models.
Architecture
A standard transformer FFN layer has fixed parameters applied to every token. An MoE layer replaces it with:
- N expert FFN networks: E₁, E₂, …, E_N — each identical in structure to a standard FFN
- A gating network G(x) that produces routing weights over experts
- Top-k selection: for each token, activate the k experts with the highest gate values
The output of the MoE layer for token x:
MoE(x) = Σᵢ∈top-k G(x)ᵢ · Eᵢ(x)
Switch Transformer vs Dense Baselines
Fedus et al. (2022) benchmarked Switch Transformer (k=1 routing) against dense T5 models on C4 pre-training:
| Model | Parameters | Active Params/Token | Steps to -1.90 NLL | Relative Speed |
|---|---|---|---|---|
| T5-Base (dense) | 223M | 223M | 500K | 1× |
| T5-Large (dense) | 739M | 739M | 500K | — |
| T5-11B (dense) | 11B | 11B | 500K | — |
| Switch-Base (128 experts) | 7.4B | ~223M | 71K | 7× vs T5-Base |
| Switch-XXL (128 experts) | 395B | ~4.7B | 250K | — |
| Switch-C (2048 experts) | 1.6T | ~7B | — | 4× vs T5-11B |
Routing Strategies
| Method | k | Load Balancing | Key Paper |
|---|---|---|---|
| Sparsely-Gated MoE | 2 | Auxiliary loss + noise | Shazeer et al. (2017) |
| Switch Transformer | 1 | Auxiliary loss | Fedus et al. (2022) |
| GShard | 2 | Local group dispatch | Lepikhin et al. (2021) |
| Expert Choice | k per expert | Naturally balanced | Zhou et al. (2022) |
Expert Capacity and Token Overflow
Expert capacity defines the maximum number of tokens routed to each expert:
capacity = capacity_factor × (total_tokens / n_experts)
With capacity_factor=1.25 (25% buffer above uniform load), most tokens are processed normally. Overflow tokens (exceeding capacity) skip the MoE layer and pass through with the residual identity. This design prevents expert underload from bottlenecking training, at the cost of some tokens not utilizing MoE layers.
See feed-forward-layers for the dense FFN that MoE replaces, and scaling-laws for how MoE interacts with compute-optimal training principles.
Related Pages
Sources
- Fedus et al. (2022) — Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR 2022
- Shazeer et al. (2017) — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017
- Lepikhin et al. (2021) — GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. ICLR 2021
Frequently Asked Questions
How does sparse MoE differ from a standard transformer FFN layer?
A standard transformer FFN applies the same learned weight matrix to every token. A sparse MoE layer has N expert FFN networks (each identical in structure to a standard FFN) and a gating network that routes each token to the top-k experts. Only the selected experts' parameters are used for each token. If N=64 experts and k=2, each token activates 2/64 ≈ 3% of the expert parameters, giving the model large total capacity while keeping per-token compute nearly the same as a single expert.
What is load balancing in MoE models and why does it matter?
Load balancing ensures tokens are distributed roughly evenly across experts. Without it, the gating network tends to collapse — repeatedly sending most tokens to the same few experts — leaving others undertrained. Switch Transformer addresses this with an auxiliary load balancing loss: L_aux = α · Σᵢ f_i · p_i, where f_i is the fraction of tokens routed to expert i and p_i is the average gating probability. This encourages uniform routing during training.
How are MoE models trained in practice?
MoE layers are distributed across multiple accelerators, with each device hosting a subset of experts. During the forward pass, tokens are dispatched across devices to their assigned experts (all-to-all communication), processed locally, then sent back (second all-to-all). The computational cost per token is similar to a standard FFN, but model capacity is multiplied by N. GShard (Lepikhin et al.) and Switch Transformer demonstrate this approach at trillion-parameter scale.