A high-throughput, GPU-focused serving engine for LLMs, designed to serve many concurrent requests efficiently using PagedAttention and continuous batching.
vLLM is an open-source inference and serving engine built for performance and scale rather than laptop convenience. It's what you run when you want to host a model for an app or many users on a GPU, typically loading safetensors weights and exposing an OpenAI-compatible API.
Its two signature techniques are PagedAttention, which stores the KV cache in non-contiguous blocks (like operating-system virtual memory) to slash wasted memory, and continuous batching, which swaps finished requests out and new ones in every step so the GPU never sits idle. Together these let it serve far more traffic than a naive loop on the same hardware.
vLLM is the natural choice for the "rent a GPU and host it yourself" path: pair it with a rented GPU to serve a model at scale, often far cheaper per token than a managed API once volume is high.
KV cache · Safetensors · Local vs API vs renting a GPU · Inference
All explainers → · Browse models →
Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.