The Best 254 Text-to-Video Tools in 2025
Xclip Base Patch32
MIT
X-CLIP is an extended version of CLIP for general video-language understanding, trained on (video, text) pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.
Text-to-Video
Transformers English

X
microsoft
309.80k
84
LTX Video
Other
The first DiT-based video generation model capable of real-time generation of high-quality videos, supporting two scenarios: text-to-video and image + text-to-video.
Text-to-Video English
L
Lightricks
165.42k
1,174
Wan2.1 14B VACE GGUF
Apache-2.0
The GGUF format version of the Wan2.1-VACE-14B model, mainly used for text-to-video generation tasks.
Text-to-Video
W
QuantStack
146.36k
139
Animatediff Lightning
Openrail
Ultra-fast text-to-video model, generating videos over ten times faster than the original AnimateDiff
Text-to-Video
A
ByteDance
144.00k
925
V Express
V-Express is an audio and facial keypoint condition-based video generation model capable of converting audio input into dynamic video output.
Text-to-Video English
V
tk93
118.36k
85
Cogvideox 5b
Other
CogVideoX is the open-source version of the video generation model derived from Qingying, providing high-quality video generation capabilities.
Text-to-Video English
C
THUDM
92.32k
611
Llava NeXT Video 7B Hf
LLaVA-NeXT-Video is an open-source multimodal chatbot that achieves excellent video understanding capabilities through mixed training on video and image data, reaching SOTA level among open-source models on the VideoMME benchmark.
Text-to-Video
Transformers English

L
llava-hf
65.95k
88
Wan2.1 T2V 14B Diffusers
Apache-2.0
Wan2.1 is a comprehensive open-source video foundation model designed to push the boundaries of video generation, supporting tasks such as text-to-video in Chinese and English, image-to-video, and more.
Text-to-Video Supports Multiple Languages
W
Wan-AI
48.65k
24
Wan2.1 T2V 1.3B Diffusers
Apache-2.0
Wan 2.1 is a comprehensive open-source video foundation model featuring top-tier performance, consumer-grade GPU support, multi-task capabilities, visual-text generation, and efficient video VAE.
Text-to-Video Supports Multiple Languages
W
Wan-AI
45.29k
38
Wan2.1 T2V 14B
Apache-2.0
Wan 2.1 is a comprehensive open-source video foundation model capable of multiple tasks including text-to-video, image-to-video, video editing, text-to-image, and video-to-audio generation, with support for bilingual Chinese-English text input.
Text-to-Video Supports Multiple Languages
W
Wan-AI
44.88k
1,238
Wan2.1 T2V 14B Gguf
Apache-2.0
A text-to-video generation model converted to GGUF format, supporting usage via ComfyUI-GGUF custom nodes
Text-to-Video
W
city96
42.38k
130
Cogvideox 2b
Apache-2.0
CogVideoX is an open-source video generation model originating from Qingying. The 2B version is an entry-level model, balancing compatibility with low operational and development costs.
Text-to-Video English
C
THUDM
40.55k
324
Animatelcm
AnimateLCM is an efficient personalized stylized video generation model that does not require personalized video data, capable of generating high-quality videos with just 4 inference steps.
Text-to-Video
A
wangfuyun
33.16k
323
Wan Gguf
Apache-2.0
The GGUF quantized version of Wan Video is a text-to-video generation model suitable for older or low-end machines, supporting efficient inference via GGUF files.
Text-to-Video English
W
calcuis
26.46k
66
Ltxv 13b 0.9.7 Dev GGUF
Other
GGUF quantized version of the 13b-0.9.7-dev variant based on Lightricks/LTX-Video, supporting text-to-video and image-to-video generation tasks.
Text-to-Video English
L
wsbagnsv1
25.99k
61
Wan2.1 Fun 1.3B Control
Apache-2.0
Wan2.1-Fun-1.3B is a text-to-video generation model that supports multi-resolution training and first/last frame prediction.
Text-to-Video Supports Multiple Languages
W
alibaba-pai
22.19k
97
Wan2.1 T2V 1.3B
Apache-2.0
Wan 2.1 is a comprehensive open-source video foundation model designed to push the boundaries of video generation, supporting tasks such as text-to-video and image-to-video generation.
Text-to-Video Supports Multiple Languages
W
Wan-AI
19.89k
319
Clip4clip Webvid150k
A CLIP4Clip video-text retrieval model trained on a subset of the WebVid dataset for large-scale video-text retrieval applications
Text-to-Video
Transformers

C
Searchium-ai
19.30k
27
Text To Video Ms 1.7b
Based on a multi-stage text-to-video diffusion model, it generates videos matching English text descriptions
Text-to-Video
T
ali-vilab
14.01k
625
Wan2.1 Fun 14B InP Gguf
Apache-2.0
A 14B-parameter multimodal model released by Alibaba PAI, supporting text-to-video generation tasks
Text-to-Video Supports Multiple Languages
W
city96
13.97k
18
Zeroscope V2 576w
A watermark-free video generation model based on Modelscope, optimized for 16:9 aspect ratio and smooth video output
Text-to-Video
Z
cerspense
12.59k
476
Cogvideox1.5 5B
Other
CogVideoX is an open-source video generation model similar to Qingying, supporting high-resolution video generation
Text-to-Video English
C
THUDM
11.12k
36
Wan2.1 Fun 14B Control
Apache-2.0
A text-to-video model supporting multi-resolution training and first/last frame prediction
Text-to-Video Supports Multiple Languages
W
alibaba-pai
10.53k
44
VACE Wan2.1 1.3B Preview
Apache-2.0
VACE is an all-round video creation and editing model that supports various tasks such as reference video generation, video-to-video editing, and masked video-to-video editing.
Text-to-Video Supports Multiple Languages
V
ali-vilab
10.05k
101
Wan2.1 VACE 14B
Apache-2.0
Wan2.1 is a comprehensive and open video foundation model designed to push the boundaries of video generation, supporting various video generation and editing tasks.
Text-to-Video Supports Multiple Languages
W
Wan-AI
8,797
176
Llava NeXT Video 7B DPO
LLaVA-Next-Video is an open-source multimodal dialogue model, fine-tuned with multimodal instruction-following data on large language models, supporting video and text multimodal interactions.
Text-to-Video
Transformers

L
lmms-lab
8,049
27
Ltxv Gguf
Other
A GGUF quantized version based on the Lightricks/LTX-Video model, supporting text-to-video, image-to-video, and video-to-video tasks.
Text-to-Video English
L
calcuis
7,378
48
Wan2.1 Fun 14B InP
Apache-2.0
A text-to-video model developed by Alibaba Cloud PAI team, supporting multi-resolution training and first-last frame prediction
Text-to-Video Supports Multiple Languages
W
alibaba-pai
7,011
40
Wan2.1 Fun 1.3B InP
Apache-2.0
Wan2.1-Fun-1.3B is a text-to-video generation model developed by Alibaba PAI team, supporting multi-resolution training and first/last frame prediction.
Text-to-Video Supports Multiple Languages
W
alibaba-pai
6,753
25
Cosmos Reason1 7B GGUF
Other
Cosmos-Reason1 is a physics AI model developed by NVIDIA, capable of understanding physical common sense and generating embodied decision-making natural language through long-chain reasoning.
Text-to-Video
Transformers English

C
unsloth
6,690
1
Wan2.1 T2V 14B
Apache-2.0
Wan2.1 is an open and advanced large-scale video generation model that supports various tasks including text-to-video and image-to-video, compatible with consumer-grade GPUs.
Text-to-Video Supports Multiple Languages
W
Isi99999
6,470
0
Ltxv 13b 0.9.7 Distilled GGUF
Other
LTX-Video is a text-to-video generation model that supports creating video content from text or images.
Text-to-Video English
L
wsbagnsv1
6,208
19
Hunyuanvideo Gguf
Other
GGUF quantized version of Tencent's Phantom Video model, designed specifically for ComfyUI for text-to-video generation tasks
Text-to-Video
H
city96
6,142
162
Animatediff Motion Lora Tilt Up
Dynamic LoRAs model that adds specific types of motion effects to animations
Text-to-Video
A
guoyww
5,936
1
Moviigen1.1
Apache-2.0
MoviiGen 1.1 is a cinematic video generation model fine-tuned based on Wan2.1, excelling in film aesthetics and visual quality.
Text-to-Video English
M
ZuluVision
5,165
47
Wan2.1 Fun 14B Control Gguf
Apache-2.0
A 14B-parameter multimodal model released by Alibaba PAI, supporting text-to-video generation tasks
Text-to-Video Supports Multiple Languages
W
city96
5,120
10
Xclip Base Patch16 Zero Shot
MIT
X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained contrastively on (video, text) pairs, suitable for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.
Text-to-Video
Transformers English

X
microsoft
5,045
24
Cosmos 1.0 Diffusion 7B Text2World
Other
A multimodal world foundation model based on diffusion architecture developed by NVIDIA, capable of generating high-quality physics-aware videos from text inputs
Text-to-Video
C
nvidia
5,011
220
LTX Video Diffusers
Diffusers-based implementation of the LTX-Video model, supporting high-quality video generation from text or images
Text-to-Video
L
a-r-r-o-w
4,519
3
I2vgen Xl
MIT
An open-source video synthesis codebase developed by Alibaba's Tongyi Lab, integrating multiple advanced video generation models
Text-to-Video
I
ali-vilab
4,252
172
LTX Video 0.9.1 Diffusers
LTX-Video model in Diffusers format, supporting text-to-video and image-to-video generation
Text-to-Video
L
a-r-r-o-w
3,951
7
Skyreels V2 T2V 14B 720P
Other
SkyReels V2 is an unlimited-length cinematic generation model that employs an autoregressive diffusion-forced architecture, supporting high-resolution video generation.
Text-to-Video
S
Skywork
3,942
25
- 1
- 2
- 3
- 4
- 5
- 6
- 7