Spanvero How it works Find a model Compare models Pricing

Mixture of Experts (MoE)

An architecture that contains many "expert" sub-networks but routes each token through only a few of them, so a model can have huge total size while doing the compute of a much smaller one.

Mixture of Experts, or MoE, is an architecture that has become central to the largest and most efficient open models — and it changes the usual rules about size and cost in a way that's important to understand before you plan hardware. In a normal (dense) model, every parameter is used for every token: run a 70B dense model and all 70 billion weights participate in producing each word. MoE breaks that assumption.

Here's how it works. In an MoE model, the feed-forward layers are split into many parallel sub-networks called experts — you might have 8, 16, 64, or more. For each token, a small learned component called the router picks just the top few experts (often only 2) to actually process that token, and the rest sit idle for that token. Because only a handful of experts run at a time, the amount of computation per token stays small, even though the model's total capacity — the sum of all the experts — can be enormous. Different experts tend to specialize, and the router learns which ones to consult for which inputs.

The payoff is efficiency: you get the knowledge capacity of a very large model at roughly the compute cost of a much smaller one. This is why MoE models can be both smart and fast. A well-known example is Mixtral 8x7B, which has about 47B total parameters but only around 13B active per token — so it runs about as fast as a 13B model while carrying far more total knowledge. OpenAI's gpt-oss-120b is another: roughly 117B total parameters with only about 5.1B active per token, giving it big-model breadth at small-model speed. DeepSeek's large models and several other flagship open releases use MoE for the same reason.

Now the catch, and it's the one that trips people up when planning local hardware. Even though only a few experts run per token, all of the experts must still be loaded into memory, because the router might pick any of them on the very next token. So an MoE model runs fast like its active size but needs VRAM (or system RAM) for its total size. Mixtral 8x7B is quick like a 13B but you must have room for all 47B of weights. gpt-oss-120b generates like a 5B but demands memory for the full 117B. This is the single most important thing to check before assuming a sparse model is cheap to host: fast does not mean small in memory.

The distinction between the two numbers is important enough to have its own explainer — see active vs total parameters — because the total figure drives your VRAM and download size while the active figure drives your speed and per-token cost. When you see a model marketed by its active size ("runs like a 5B!"), always find the total size for your memory math.

A practical consequence for local users is that MoE models are often a good fit for the "lots of RAM, modest GPU" situation. Because only a few experts are active per token, MoE models can run at a usable speed even when parts of them sit in slower system RAM rather than VRAM — the CPU offload penalty hurts less than it would for a dense model of the same total size, since far fewer weights are touched per token. This is one reason large MoE models became popular for enthusiasts running on high-RAM machines and Apple Silicon Macs with big unified memory. It doesn't change the memory requirement (you still need room for everything), but it softens the speed hit of not fitting entirely in VRAM.

MoE also shapes how these models are quantized and served. Because so much of the parameter count lives in the experts, quantizing them well is where most of the memory savings comes from, and serving engines like vLLM include specific optimizations for routing and expert parallelism at scale. When you're evaluating a large open flagship and it turns out to be MoE, the takeaways are consistent: expect small-model speed, plan for large-model memory, and check both the total and active figures before assuming anything about cost or fit.

Spanvero reports both figures where a model is MoE, and computes VRAM-to-run from the total size (because that's what has to fit) while reflecting the active size in speed and cost expectations. To see the real memory a specific MoE model needs and the honest cost across running it locally, on a rented GPU, or via your own API key, open /calculator/, or browse the large-model tier and MoE flagships under /models/.

Active vs total parameters · Parameters (the "B" / billions) · VRAM · Inference · Quantization · Local vs API vs renting a GPU

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

Mixture of Experts (MoE)

Related

The weekly price index