Spanvero How it works Find a model Compare models Pricing

Embeddings

Numeric vectors that represent the meaning of text (or images), so that similar content sits close together — the backbone of semantic search and Retrieval-Augmented Generation (RAG).

Embeddings are one of the most useful and widely-deployed ideas in modern AI, and they solve a different problem from the chat models people usually think of. An embedding model takes a piece of text — a sentence, a paragraph, a document chunk — and turns it into a fixed-length list of numbers called a vector (often a few hundred to a couple thousand numbers long). The key property is that this vector captures meaning: texts about similar topics produce vectors that are close together in the vector space, while unrelated texts produce vectors that are far apart. That lets a computer measure how similar two pieces of text are by comparing their vectors mathematically, rather than by matching exact keywords.

This is a big deal because it enables semantic search — finding things by meaning instead of by literal word matching. A keyword search for "how to reset my password" misses a document titled "account recovery steps" because the words don't overlap; a semantic search using embeddings finds it, because the meanings are close. The same mechanism powers recommendations ("items similar to this one"), clustering and deduplication (grouping similar content), and classification.

The single most important application today is Retrieval-Augmented Generation, or RAG, which is the standard way to make a chat model answer questions about your own documents. The recipe is: first, split your documents into chunks and run each through an embedding model, storing the resulting vectors in a vector database. Then, when a user asks a question, embed the question the same way, search the database for the chunks whose vectors are closest to it, and paste those chunks into the chat model's context window before it answers. The model then responds grounded in your actual documents. RAG is why embeddings show up constantly in real applications — it's usually the right, cheap alternative to fine-tuning when your goal is to inject up-to-date facts rather than teach a new behavior or style (see the fine-tuning explainer for that distinction).

A practical thing to know: embedding models are a different job from text generation, and they're much smaller, cheaper, and faster than chat LLMs. A generative model outputs sentences; an embedding model outputs vectors and nothing else — you never "chat" with it. Because they're small, many embedding models run comfortably on modest hardware, even CPU-only, so the embedding step in a RAG pipeline is rarely the expensive part. When you compare embedding models, you care about the vector dimension, the context length they accept per chunk, the languages they cover, and their retrieval accuracy on benchmarks like MTEB — not about conversational quality.

A few practical details help when choosing and using an embedding model. The vector's dimension (how many numbers it has) affects both quality and storage — higher-dimensional vectors can capture more nuance but cost more to store and compare across a large corpus. Each model also has a maximum input length, which sets how big your document chunks can be; if a chunk exceeds it, the extra text is truncated, so chunking strategy matters for retrieval quality. And embeddings are model-specific: a vector produced by one embedding model is only comparable to other vectors from the same model, so you must embed your documents and your queries with the same model, and re-embed everything if you switch models later.

Similarity between vectors is usually measured with cosine similarity — essentially the angle between two vectors — where a higher score means closer meaning. You don't need the math to use embeddings, but it explains what a vector database is doing under the hood: storing your document vectors and, for each query, quickly finding the ones with the highest similarity so they can be handed to the chat model. The quality of a RAG system depends heavily on this retrieval step getting the right chunks, which is why the embedding model choice and the chunking strategy matter as much as the chat model that writes the final answer.

Everything is still counted in tokens here too: embedding a large corpus costs tokens (and, on hosted APIs, money) proportional to how much text you push through, so token math from the tokens explainer applies directly to estimating an embedding job's cost. In Spanvero's catalog, embedding models are tagged distinctly so you can find them apart from chat models — you can browse the catalog at /models/ and use the calculator at /calculator/ to compare the honest, $0-markup cost of running an embedding model locally versus via an API key, which matters when you're embedding a large document set.

Tokens · Context window · Fine-tuning · Inference · VRAM · Local vs API vs renting a GPU

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

Embeddings

Related

The weekly price index