Spanvero How it works Find a model Compare models Pricing

Local vs API vs renting a GPU

The three ways to actually run an open model: on your own hardware (local, $0 compute), through a hosted pay-per-token API, or by renting a cloud GPU and serving it yourself — each cheapest in different situations.

Once you've picked an open model, there are exactly three ways to actually run it, and choosing the right one is often the difference between spending nothing and overpaying by a wide margin. There's no single "cheapest" answer — it genuinely depends on your usage pattern, your hardware, your privacy needs, and your scale. Understanding the trade-offs is precisely what lets you pick the honest best option, and it's the core question Spanvero exists to answer.

Running it locally means running the model on hardware you already own, using a tool like Ollama, LM Studio, or llama.cpp directly. The compute cost is essentially $0 — you pay only for electricity — and there are two other big advantages: your data never leaves your machine (real privacy), and it works fully offline. The limits are your own hardware: the model has to fit in your VRAM (or run partly on slower system RAM), so large models may not fit at all or may generate slowly, and you're responsible for setup. Local is the clear winner for personal use, development, privacy-sensitive work, and any model small enough for your card. Because the marginal cost is zero, high personal volume is free.

Using a hosted API means calling a provider that runs the model for you and charges per token — priced per million tokens, usually with a lower rate for input and a higher rate for output (see the tokens explainer). The upsides are that you need no hardware, you can access very large models you couldn't run yourself, and it scales instantly with zero operational burden. The downsides are that cost scales directly with usage (light and spiky usage is cheap; heavy sustained usage gets expensive), and your data leaves your machine to the provider. An API is usually the cheapest and simplest choice when your volume is low, bursty, or unpredictable, or when you need a model far too big for your own hardware. A key honesty point: the fair way to use an API through Spanvero is with your own API key, so you pay the provider's real rate directly with no reseller markup.

Renting a cloud GPU means paying by the hour for a GPU in the cloud and serving the model yourself on it — typically with a high-throughput engine like vLLM. This gives you big-model performance without buying expensive hardware, and at sustained high volume it's frequently the cheapest option of all, because you're paying a flat hourly rate rather than per token. The catch is that you pay for the GPU whether it's busy or idle, so it only wins when you keep it well-utilized, and you own the setup and operations. Renting shines for steady, high-throughput workloads and for running large models you want control over.

The practical decision, roughly: run it locally if it fits your hardware and you value privacy or have zero marginal budget; use a pay-per-token API (with your own key) for light, spiky, or occasional use, or for models too big to self-host; rent a GPU and self-serve with vLLM when your volume is high and steady enough that a flat hourly rate beats per-token pricing. The break-even between the API and the rented-GPU paths is a real number that depends on your token volume and how busy you'd keep the GPU.

Making exactly this trade-off transparent is the entire reason Spanvero exists. For any open model, we compute and compare the real cost of all three routes — local ($0 compute, given the VRAM you'd need), rent-your-own-GPU at the vendor's direct hourly price, and bring-your-own-key API at the provider's real per-token rate — with zero markup, so you can pick the genuinely cheapest route for your situation rather than guessing. Enter your own workload at /calculator/, browse models by what your hardware can run at /models/24gb-vram/ and its siblings, or compare two models' costs across all three routes side by side under /compare/.

VRAM · vLLM · Ollama · Inference · Tokens · LM Studio

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

Local vs API vs renting a GPU

Related

The weekly price index