How many tokens per second is usable?

A rough guide: below ~5 tokens/sec feels sluggish, ~10-20 is comfortable for chat (around or above reading speed), and higher is nice but has diminishing returns for a single reader — what you actually get depends on the model, quant, and hardware.

Tokens per second (tok/s) is how generation speed is measured — it's simply how fast a model emits its output, one token at a time. Since a token is roughly three-quarters of a word in English, you can translate tok/s into a feel for the experience, which is the honest way to answer "how fast is fast enough," because the right number depends entirely on what you're doing.

For interactive chat with a single user, the useful anchor is human reading speed. Most people read somewhere around 5-8 words per second when following along, which is very roughly 7-10 tokens per second. So once a model generates at about reading speed or a bit above, the text appears about as fast as you can consume it and the experience feels smooth. As a rough, honest guide: under about 5 tok/s feels sluggish and you'll notice yourself waiting; around 10-20 tok/s is comfortable for chat and coding; above that is pleasant but has diminishing returns for a single person reading the output, because you can't read faster than the words appear anyway.

The context changes what "usable" means, though. For a back-and-forth chat, near-reading-speed is fine. For an agent or automated pipeline that generates long outputs you don't read in real time — batch summarization, code generation you'll review later, processing many documents — raw throughput matters more, and higher tok/s directly cuts how long the job takes. For serving many users at once, what matters is aggregate throughput across all requests, not the speed of any single stream, which is a different measurement handled by serving engines that batch requests. So "how many tok/s is usable" has different answers for a chatting human, an offline batch job, and a multi-user service.

There's also the first-token delay to consider, separate from the streaming speed. Generation happens in two phases: the model first reads your entire prompt (prefill), then generates the answer token by token (decode). A long prompt makes the prefill take longer, so the first token can lag even when the subsequent streaming is quick. If responses feel slow to start but then flow fine, a long input is usually why — not the tok/s figure itself.

What determines the tok/s you actually get? Four things dominate: the model's active size (bigger and denser is slower per token — this is where the Mixture-of-Experts distinction helps, since an MoE model generates at its small active size even when it's large in total), the quantization level, your hardware's memory bandwidth and compute, and your context length (a longer context means a bigger KV cache to read each step). This is why the same model can feel snappy on one machine and sluggish on another, and why a smaller or more heavily quantized model is one lever to speed things up if generation feels slow.

An honest note on the numbers: real tok/s figures vary so much across models, quants, and hardware that any single quoted number would be misleading. A small model on a fast GPU can produce hundreds of tokens per second; a large model split partly onto CPU can crawl at a few. Rather than trust a benchmark someone else ran on different hardware, the reliable approach is to try a model on your own machine — the friendly local tools show you the tok/s live as they generate, so you can judge for yourself whether it's comfortable for your use.

If a model is too slow on your hardware, the honest fixes are: run a smaller model or a lower quant, shorten your context, use a card with more memory bandwidth, or move to a rented GPU or a hosted API where the model runs on faster hardware. Which makes sense depends on why you need the speed and how much you'll use it.

Because tok/s depends so heavily on your specific setup, Spanvero focuses on the objective drivers — model size, quant, and the VRAM/hardware you'd run on — and points you to test speed on your own machine rather than quoting benchmarks we didn't run. To reason about the trade-offs, use /calculator/ to see how model size and quant affect the memory and hardware you'd need, browse smaller and MoE models that generate faster at /best/best-small-llms/, and see live pricing and model data at /trends/.

Related

Tokens · Inference · KV cache · Mixture of Experts (MoE) · What GPU should I buy for running local LLMs? · What quantization should I use? · VRAM · How do I choose which AI model to run?

All explainers → · Browse models →

Open the free Spanvero advisor → · Honest, $0-markup. © 2026 Cynosure LLC.

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.