Spanvero How it works Find a model Compare models Pricing

Inference

Actually running a trained model to get outputs — generating text, an image, or a transcription — as opposed to training it.

Inference is the "use" phase of a model. You give a trained model a prompt and it produces a response — a paragraph of text, an image, a transcription of audio. That's inference. It's the counterpart to training, which is the one-time, massively expensive process of creating the model's weights in the first place. The clean way to hold the two apart: training builds the model (done once, costs a fortune, needs a data center), while inference uses the model (done constantly, every single time anyone interacts with it). When you chat with a local model, generate an image, or call a hosted API, you are paying for inference — and because inference happens over and over, it's the cost that actually adds up in real use. It's also the cost Spanvero exists to make transparent.

What drives the cost and speed of inference? For text models, four things dominate: the model's active parameter size (bigger and denser means more compute per token — this is where the MoE distinction between active and total parameters matters), the quantization level (lower precision runs faster and lighter), the hardware you run on, and how many tokens are involved. Speed is usually reported in tokens per second, which is simply how fast the model emits its answer.

Text generation actually happens in two distinct phases, and knowing them explains a lot of observed behavior. The first is prefill: the model reads and processes your entire prompt at once. This is parallel and fast, but it's where a long prompt costs you — a huge input document takes real time and memory to ingest, and it's why the first token of a response can lag when the prompt is long. The second is decode: the model generates the output one token at a time, each new token depending on all the ones before it. Decode is inherently sequential, which is why output speed is measured per token and why generating a very long answer takes proportionally long. The KV cache is what keeps decode from being catastrophically slow — it stores the attention keys and values for tokens already processed so the model doesn't recompute the whole sequence for every new token.

Inference is also where the practical choice of how to run a model plays out. The same model can be run locally on your own hardware (effectively $0 in compute beyond electricity), on a rented cloud GPU you control, or through a hosted pay-per-token API — and the honest cheapest option depends entirely on your volume and situation. Serving engines exist specifically to make inference efficient at scale: vLLM, for instance, uses clever KV-cache management and batching to serve many concurrent requests on the same GPU. For a single user at home, a friendly tool like Ollama or LM Studio handles inference just fine.

Why does the training-versus-inference distinction matter so much for cost decisions? Because they have completely different economics. Training a large model is a one-time capital cost measured in millions of dollars and weeks of data-center GPU time — something almost no individual or small team does from scratch, which is exactly why open weights are so valuable: someone else already paid the training bill, and you get to reuse it. Inference, by contrast, is an ongoing operating cost you pay every time you use the model. So when people talk about the "cost of AI" in day-to-day use, they almost always mean inference cost, and optimizing it — by choosing the right model size, quant, and running route — is where you actually save money.

A useful mental model for inference speed: the prefill phase is compute-bound (it does a lot of parallel math to read your prompt), while the decode phase is typically memory-bandwidth-bound (each new token requires reading the model's weights and the growing KV cache from memory). This is why a card with high memory bandwidth generates tokens quickly, and why very large models feel slower to generate even when they fit — there's simply more data to move per token. It's also why batching many requests together, as serving engines do, improves total throughput: the expensive weight reads get shared across all the requests in the batch.

Everything Spanvero estimates — VRAM-to-run, run cost, tokens-per-second expectations — is about inference, because that's the recurring, real cost of actually using an open model. To see the honest, $0-markup inference cost for any model across all three routes for your own workload, open the calculator at /calculator/, browse models by what your hardware can run under /models/, or compare two models' running costs side by side under /compare/.

Tokens · VRAM · Local vs API vs renting a GPU · KV cache · Active vs total parameters · Quantization

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

Inference

Related

The weekly price index