Alibaba Qwen · Omni understanding + speech generation · 11B params · Apache-2.0 (commercial OK)
End-to-end Thinker-Talker model that takes text, image, audio and video in and streams back both text and natural speech, making it the flagship open omni model that actually talks. Note the 3B sibling exists but ships under a non-commercial qwen-research license.
Note: generative-media models are billed per image / per second / per minute on hosted services — not per token. Running locally or on your own rented GPU is usually far cheaper and keeps your data on your machine.
| Does | Visual understanding, Audio understanding, Video understanding, Text → speech, Speech → speech, Any → text |
| VRAM to run | ~24 GB (~31GB in BF16; fits a single 24GB card with flash-attention + reduced context, or use the official AWQ 4-bit build for ~12GB. CPU offload possible but slow for the streaming Talker.) |
| Download | ~31 GB |
| Parameters | 11B |
| License | Apache-2.0 (commercial use OK) |
| Run with | Transformers (qwen-omni-utils) |
Get Qwen2.5-Omni-7B on Hugging Face →
Browse: all media models · chat / LLM models
Open the free Spanvero advisor → · We point you to the open weights + your own accounts, $0 markup, never resell compute. © 2026 Cynosure LLC.