Actually running a trained model to get outputs — generating text, an image, or a transcription — as opposed to training it.
Inference is the "use" phase of a model: you give it a prompt and it produces a response. It's distinct from training, which is the one-time, expensive process of creating the model's weights. When you chat with a local model, generate an image, or call an API, you're paying for inference.
Inference cost and speed depend on the model's active size, the quantization, the hardware, and how many tokens are involved. For text models, speed is usually reported in tokens per second, and there are two phases: "prefill" (reading your prompt) and "decode" (generating output one token at a time).
Everything Spanvero estimates — VRAM-to-run, run-cost, tokens/sec expectations — is about inference, because that's the recurring cost of actually using an open model.
Tokens · VRAM · Local vs API vs renting a GPU · KV cache
All explainers → · Browse models →
Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.