Spanvero How it works Find a model Compare models Pricing

vLLM

A high-throughput, GPU-focused serving engine for LLMs, designed to serve many concurrent requests efficiently using PagedAttention and continuous batching.

vLLM is the tool you reach for when you stop asking "how do I run a model on my laptop?" and start asking "how do I serve a model to an app or many users, efficiently, on a GPU?" It's an open-source inference and serving engine built for performance and scale rather than single-user convenience — the production counterpart to friendly local tools like Ollama and LM Studio. If those tools are for running a model for yourself, vLLM is for hosting a model for others.

The key thing that sets vLLM apart is how much traffic it can squeeze out of the same hardware, and that comes from two signature techniques. The first is PagedAttention, its approach to managing the KV cache — the per-request memory that stores attention keys and values and that grows with context length (see the KV cache explainer). Naively, a server reserves one big contiguous block of memory per request, which wastes a lot when requests are of different, unpredictable lengths. PagedAttention instead stores the KV cache in small, non-contiguous blocks, exactly like an operating system's virtual-memory pages. This nearly eliminates the wasted memory, which means more requests fit on the same GPU at once. The second technique is continuous batching: instead of processing a fixed batch of requests to completion before starting the next, vLLM swaps finished requests out and new ones in at every generation step, so the GPU never sits idle waiting for the slowest request in a batch. Together these let vLLM serve far more concurrent traffic — and far more tokens per second in aggregate — than a naive request-by-request loop on identical hardware.

Operationally, vLLM is GPU-focused and loads models from their full-precision safetensors weights (rather than the quantized GGUF format that local tools use, though vLLM supports several quantization schemes of its own). It exposes an OpenAI-compatible API server, so applications written against the common API shape can point at your vLLM instance with minimal changes. It supports tensor parallelism to split a large model across multiple GPUs when one card isn't enough for the model's total size — which matters especially for big dense models and large MoE models where the total parameter count drives memory needs (see active vs total parameters).

When does vLLM make financial sense? It's the natural engine for the "rent a GPU and host the model yourself" path. Pair vLLM with a rented cloud GPU (or your own hardware) and you can serve an open model at scale, and once your volume is high and steady, the cost per token is often dramatically lower than paying a managed API's per-token rate. The trade-off is that you're paying for the GPU whether it's busy or idle, and you own the operational setup. That's the classic high-volume calculus laid out in the local vs API vs renting a GPU explainer: for light or spiky usage, a pay-per-token API is usually cheaper and simpler; for sustained heavy traffic, self-serving with vLLM on a rented GPU tends to win.

A fair question is when vLLM is overkill. For a single person chatting with a model, or a developer prototyping locally, vLLM's throughput advantages don't help — you have one request at a time, so the batching machinery has nothing to batch, and the simpler experience of Ollama or LM Studio is a better fit. vLLM earns its keep specifically under concurrency: multiple users or requests hitting the same model at once, where PagedAttention and continuous batching let a single GPU serve many of them efficiently. The rough rule is that vLLM is for serving, and local tools are for using. If you're building an app or API that others will call, vLLM (or a similar production engine) is the right layer; if you're the only user, you don't need it.

Because it's the standard engine for self-hosted serving, vLLM also anchors the economics of running your own model at scale. Its efficiency is what makes the rent-a-GPU path competitive against hosted APIs: the more requests you can pack onto one rented GPU, the lower your effective cost per token, which is what shifts the break-even in favor of self-hosting once your volume is high and steady. That's precisely the calculation to run before committing — a lightly-used rented GPU is expensive per token, while a well-utilized one can be very cheap.

This rent-and-serve route is one of the three ways Spanvero prices for every model, with $0 markup — we show the honest cost of a rented GPU running your model against the local ($0 compute) and your-own-API-key options, so you can find the genuine break-even point for your volume. To compare all three for a specific model and workload, open the calculator at /calculator/, or see them side by side on any model's page under /models/.

KV cache · Safetensors · Local vs API vs renting a GPU · Inference · Ollama · llama.cpp

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

vLLM

Related

The weekly price index