Spanvero How it works Find a model Compare models Pricing

Active vs total parameters

In an MoE model, total parameters is everything that must be loaded into memory, while active parameters is the smaller subset actually used per token — the former drives memory, the latter drives speed and cost.

"47B total / 13B active." "117B total / 5.1B active." If you've looked at modern open models, you've seen these two-number specs, and understanding the difference between them is one of the highest-value things you can learn for planning what to run. The two numbers answer two different questions, and confusing them leads directly to buying the wrong hardware or being surprised by a model that won't load.

Start with the simple case. For a dense (non-MoE) model, total and active parameters are the same number — every parameter is used for every token, so there's only one figure to think about. A dense 8B model has 8B total and 8B active. The distinction only appears with Mixture-of-Experts models, which split their feed-forward layers into many experts and route each token through only a few of them (see the Mixture of Experts explainer for how the routing works).

For an MoE model, the two numbers mean genuinely different things:

Total parameters is the size of the entire model — all the experts added together. This is what determines how much memory you need, because all of the experts must be loaded into VRAM or RAM at once; the router can call on any of them at any moment. Total parameters also drives the download size and the disk footprint. When you're asking "will this fit on my GPU?", total parameters is the number to run your VRAM math on.

Active parameters is the subset actually used to process a single token — the few experts the router selects, plus the always-on shared layers. This is what determines compute cost per token, which in turn sets the generation speed and, for hosted APIs, the per-token price. When you're asking "how fast will this be?" or "how much will each token cost?", active parameters is the number that matters.

Concrete examples make it click. Mixtral 8x7B is "47B total / 13B active": it needs memory for 47B of weights but computes like a 13B model, so it's quick but hungry for VRAM. OpenAI's gpt-oss-120b is "117B total / 5.1B active": it generates roughly as fast as a 5B model — remarkably nimble — yet you must have room for the full 117B of weights to run it at all. That gap between how fast it feels and how much memory it demands is the whole point of MoE, and the whole trap for the unwary.

The practical rule is short: when a model is advertised by its active size ("as fast as a 5B!"), always look up its total size before you plan hardware, because the total size is what has to fit. A model that "runs like a 5B" but needs 60+ GB of VRAM is not a small-GPU model, however fast it is once loaded. Both figures are objective specs — no quality judgment involved — and both are worth quoting.

Why does the architecture bother with this split at all? Because it decouples capacity from cost. Training and running a dense model that's genuinely as knowledgeable as a large MoE would cost far more per token, since every one of its parameters would fire on every token. MoE lets model builders scale up total capacity — and therefore breadth of knowledge and skill — while keeping the per-token compute (and thus the serving cost) low. That's the economic reason the largest and most capable open models increasingly use MoE: it's the most cost-effective way to be big.

For your own planning, the split has one more implication about downloads and disk. Total parameters also set the download size and storage footprint, which follow the same roughly-2-GB-per-billion rule at full precision (less once quantized). So an MoE model that's fast to run can still be a large, slow download and take up substantial disk space — the "runs like a 5B" speed does nothing to shrink the file. Budget disk and bandwidth for the total size, budget VRAM for the total size, and budget time-per-token for the active size. Keeping those three straight prevents nearly every MoE surprise.

Spanvero surfaces both numbers for MoE models and, crucially, computes VRAM-to-run from the total size, because that's the honest constraint on whether you can run it. To see the real memory a specific model needs and the honest, $0-markup cost across local, rented-GPU, and your-own-API-key options, open /calculator/, and to filter models to what fits your card at their default quant, browse /models/24gb-vram/ or /models/48gb-vram/.

Mixture of Experts (MoE) · Parameters (the "B" / billions) · VRAM · Inference · Local vs API vs renting a GPU · Quantization

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

Active vs total parameters

Related

The weekly price index