An architecture that contains many "expert" sub-networks but routes each token through only a few of them, so a model can have huge total size while doing the compute of a much smaller one.
In a Mixture-of-Experts model, the feed-forward layers are split into many parallel "experts." A small learned router picks the top-K experts (often just 2) for each token and ignores the rest. This means total capacity can be very large while the work done per token stays small.
The payoff is efficiency: you get the knowledge capacity of a big model at the compute cost of a small one. For example, Mixtral 8x7B has ~47B total parameters but only ~13B active per token, and GPT-OSS-120B has ~117B total with only ~5.1B active.
The catch for local users: all experts must still be loaded into memory even though only a few run per token. So an MoE is fast like its active size but needs VRAM/RAM for its total size — a key thing to check before assuming a sparse model is cheap to host.
Active vs total parameters · Parameters (the "B" / billions) · VRAM · Inference
All explainers → · Browse models →
Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.