Question 1

How does sparse MoE differ from a standard transformer FFN layer?

Accepted Answer

A standard transformer FFN applies the same learned weight matrix to every token. A sparse MoE layer has N expert FFN networks (each identical in structure to a standard FFN) and a gating network that routes each token to the top-k experts. Only the selected experts' parameters are used for each token. If N=64 experts and k=2, each token activates 2/64 ≈ 3% of the expert parameters, giving the model large total capacity while keeping per-token compute nearly the same as a single expert.

Question 2

What is load balancing in MoE models and why does it matter?

Accepted Answer

Load balancing ensures tokens are distributed roughly evenly across experts. Without it, the gating network tends to collapse — repeatedly sending most tokens to the same few experts — leaving others undertrained. Switch Transformer addresses this with an auxiliary load balancing loss: L_aux = α · Σᵢ f_i · p_i, where f_i is the fraction of tokens routed to expert i and p_i is the average gating probability. This encourages uniform routing during training.

Question 3

How are MoE models trained in practice?

Accepted Answer

MoE layers are distributed across multiple accelerators, with each device hosting a subset of experts. During the forward pass, tokens are dispatched across devices to their assigned experts (all-to-all communication), processed locally, then sent back (second all-to-all). The computational cost per token is similar to a standard FFN, but model capacity is multiplied by N. GShard (Lepikhin et al.) and Switch Transformer demonstrate this approach at trillion-parameter scale.

Measure	Value	Unit	Notes
Switch Transformer total parameters	1.6	trillion	Fedus et al. (2022); each token activates ~7B parameters via top-1 routing
Switch Transformer parameters activated per token	~7	billion	Only 1 expert per MoE layer activated; ~0.4% of total parameters per token
Pre-training speedup (Switch vs dense T5-11B)	7×	faster	Same compute budget; Switch reaches equivalent perplexity 7× faster in steps
Typical top-k routing	k = 1 or 2	experts per token	k=1 (Switch); k=2 (GShard, most other MoE); k>2 shows diminishing returns
Expert capacity factor	1.0–1.5		Maximum tokens per expert = capacity_factor × (tokens/n_experts); overflow tokens skip MoE layer

Model	Parameters	Active Params/Token	Steps to -1.90 NLL	Relative Speed
T5-Base (dense)	223M	223M	500K	1×
T5-Large (dense)	739M	739M	500K	—
T5-11B (dense)	11B	11B	500K	—
Switch-Base (128 experts)	7.4B	~223M	71K	7× vs T5-Base
Switch-XXL (128 experts)	395B	~4.7B	250K	—
Switch-C (2048 experts)	1.6T	~7B	—	4× vs T5-11B

Method	k	Load Balancing	Key Paper
Sparsely-Gated MoE	2	Auxiliary loss + noise	Shazeer et al. (2017)
Switch Transformer	1	Auxiliary loss	Fedus et al. (2022)
GShard	2	Local group dispatch	Lepikhin et al. (2021)
Expert Choice	k per expert	Naturally balanced	Zhou et al. (2022)

Mixture of Experts: Sparse Gating, Switch Transformer, and Efficient Scaling

Architecture

Switch Transformer vs Dense Baselines

Routing Strategies

Expert Capacity and Token Overflow

Related Pages

Sources

Frequently Asked Questions

How does sparse MoE differ from a standard transformer FFN layer?

What is load balancing in MoE models and why does it matter?

How are MoE models trained in practice?