Open image, video & voice models

62 open generative-media models you can actually download and run — text-to-image, video, text-to-speech, speech-to-text, music, and unified multimodal. For each: what it does, its download size, the VRAM to run it locally, the license, and how to run it. We don't host them — we point you to the real weights.

Unlike chat models, media models are priced per image / per second / per minute, not per token — so we show the honest "$0 on your own hardware, or rent a GPU by the hour" path.

Image · 16

Open text-to-image and image-editing models you can download and run. See all →

Video · 14

Open text-to-video and image-to-video models — heavier, but runnable on a rented GPU. See all →

Voice & Audio · 20

Open text-to-speech, voice cloning, speech-to-text and music models. See all →

Multimodal / Omni · 12

Unified models that handle several modalities — image, audio, video and text — in one. See all →

Looking for chat / LLM models? →

Open the free Spanvero advisor → · We point you to the open weights + your own accounts, $0 markup, never resell compute. © 2026 Cynosure LLC.