Spanvero How it works Find a model Compare models Pricing

Text-to-image

Generating an image from a written description (a prompt); today this is almost always done with diffusion models.

Text-to-image is exactly what it sounds like: you type a description in words and a model produces a picture that matches it. Type "a watercolor fox in a snowy forest at dusk" and the model renders that scene from scratch. It's one of the most visible and popular applications of AI, and thanks to open models you can run it entirely on your own hardware. Understanding how it works and what drives its cost helps you choose a model and set realistic expectations.

The leading approach — behind essentially every open text-to-image model you can run yourself, including Stable Diffusion, SDXL, and FLUX — is the diffusion model. In brief, a diffusion model starts from pure random noise and iteratively "denoises" it over many steps into a coherent image, with the process steered at each step by an interpretation of your prompt. The full mechanics are in the diffusion model explainer; text-to-image is the specific task of applying that machinery to turn words into pictures. (Diffusion also powers image-to-image editing, inpainting, and increasingly video.)

Several levers shape your results, and they're worth knowing because they also affect cost and speed. The prompt itself is the biggest one — descriptive, specific prompts generally produce better images, and good prompting is a skill. A negative prompt lets you list things to keep out ("blurry, extra fingers, text"). The number of steps controls refinement versus speed: more steps usually mean a cleaner image but a longer wait, with diminishing returns past each model's sweet spot. The guidance scale controls how tightly the image follows your prompt — a balance between fidelity and natural-looking output. And a random seed makes a run reproducible, so you can lock in a composition and then tweak other settings. Output resolution matters too: higher resolutions look better but demand more VRAM and time per image.

Those last points are the crux of the cost picture. Text-to-image models are a different modality from text LLMs — they produce pixels, not tokens — so their run cost is measured per image rather than per token or per second. The two things that drive it are how much VRAM each generation needs (a function of the model size and your output resolution) and how long each image takes (a function of steps and hardware). Many capable open image models run comfortably on a mid-range consumer GPU, though the largest and highest-resolution ones want more VRAM. Licensing is also worth checking before commercial use: some open image models are permissively licensed while others (certain FLUX variants, for instance) are non-commercial. Customizing the look with LoRA adapters — small style/subject add-ons — is extremely common in this space.

The open text-to-image landscape has a few recognizable families worth knowing as reference points. The Stable Diffusion line (including SDXL) is the long-standing open workhorse with an enormous ecosystem of community fine-tunes and LoRAs. The FLUX family raised the quality bar for open models, with variants under different licenses — some permissive, some non-commercial — so the license check matters here in particular. There are also newer entrants like Qwen-Image. You don't need to memorize the roster; the point is that "open text-to-image" isn't one model but a field of them at different quality levels, hardware needs, and license terms, and choosing among them is a fit decision, not a single right answer.

Running text-to-image yourself typically means an ecosystem like ComfyUI (a flexible node-based pipeline) or a Diffusers-based workflow, rather than the GGUF tools used for chat models — the same toolchain note covered in the diffusion model explainer. The practical planning questions stay simple, though: how much VRAM does a single generation at your target resolution need, how many seconds per image on your hardware, and does the license permit your intended use. Those three answers determine whether local generation is comfortable or whether renting a GPU is the better call, especially for higher resolutions or batch generation.

In Spanvero's catalog, text-to-image models live in the media section alongside video and audio models, separate from text LLMs, with cost and hardware fit reported per image rather than per token. To find the recognized open image models with their honest VRAM-to-run and license details, see /best/best-open-image-generation-models/, browse the broader media catalog under /models/, and use the calculator at /calculator/ to compare the honest, $0-markup cost of generating locally versus on a rented GPU. The audio side of media generation is covered in the TTS / ASR explainer.

Diffusion model · LoRA · VRAM · TTS / ASR (text-to-speech & speech recognition) · Local vs API vs renting a GPU · Inference

All explainers → · Browse models →

The weekly price index

A short email of real AI price moves, straight from the daily log — no hype. We're collecting the list now; the first issue goes out when it opens. Unsubscribe with one click.

Joining the list needs JavaScript — or just email support@spanvero.com and we'll add you.

Text-to-image

Related

The weekly price index