Spanvero How it works Find a model Compare models Pricing

Context window

The maximum number of tokens (prompt plus generated output) a model can consider at once; anything beyond it is cut off or forgotten.

The context window is a model's working memory for a single request, measured in tokens. It's the total amount of text the model can "see" at one time, and it has to hold everything at once: the system prompt that sets the model's behavior, the conversation history or documents you provide, and the response the model is generating. When people ask why a chatbot "forgets" what you said earlier in a long conversation, the context window is the answer — the earliest messages have scrolled out of its window and the model can no longer see them.

Context sizes have grown enormously. Early open models had 2K or 4K token windows. Today it's common to see 8K, 32K, and 128K, and some newer models advertise 200K or even 1M tokens. To put that in perspective using the token rule of thumb (~0.75 words per token), a 128K window is roughly 96,000 words — a good-sized book. That's what makes long-context models attractive: you can drop an entire codebase, a long legal document, or a lengthy chat history into a single request and have the model reason over all of it.

But context is not free, and this is the part people miss. The KV cache — the memory that stores the attention keys and values for every token in context — grows in direct proportion to how many tokens are in the window. Feed a model a huge context and the KV cache can balloon to consume as much VRAM as the model's weights, or more. This is why long-context serving is memory-hungry, and why a model that fits comfortably on your GPU at a 4K context might run out of memory at 128K. If you're running locally, the context length you actually use is a real lever on your VRAM budget: shortening it frees up memory, and extending it costs memory. On hosted APIs, longer contexts also cost more, because you're paying per token for everything in the window.

There's also a quality caveat worth stating honestly. A large advertised context window is the maximum the model can technically accept — it does not guarantee the model uses all of it well. Many models degrade at recalling information buried in the middle of a very long context (the so-called "lost in the middle" effect). So a bigger number is better all else equal, but it's a capacity spec, not a promise of perfect long-range recall.

It's also useful to know that some models reach their headline context length through extension techniques rather than being trained natively at that length. Methods like RoPE scaling stretch a model trained at, say, 8K tokens to accept 32K or more. These work, but a model's recall over the extended range is often weaker than over the range it was actually trained on — another reason to treat a very large advertised window as a ceiling, not a guarantee. When long-context reliability really matters for your task, it's worth testing the model on your own long inputs rather than trusting the spec sheet alone.

When you do run into the limit, there are standard ways to cope. For long chats, applications summarize or drop older turns to stay within the window. For big documents that exceed even a large window, the common pattern is retrieval — splitting the text into chunks, storing them as embeddings, and pulling only the relevant pieces into context for each question (see the embeddings explainer). That way you can work with material far larger than any single context window, feeding the model only what's relevant at a time, which is also cheaper than stuffing everything in.

Because context window is an objective catalog fact, Spanvero treats it as a filter and a spec, never a quality claim. You can sort and filter models by it — for example, browse the long-context picks at /best/best-long-context-llms/ for models with 128K+ windows. And because context length directly drives memory use, the calculator at /calculator/ lets you set the context you plan to use and see how it changes the VRAM needed and the honest cost to run — locally, on a rented GPU, or via your own API key. If you're comparing two models where context matters, the head-to-head pages under /compare/ show each one's window side by side.

Tokens · KV cache · VRAM · Inference · Embeddings · Parameters (the "B" / billions)

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

Context window

Related

The weekly price index