Text-to-Video

The Best 254 Text-to-Video Tools in 2025

Xclip Base Patch32

X-CLIP is an extended version of CLIP for general video-language understanding, trained on (video, text) pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.

Transformers English

The first DiT-based video generation model capable of real-time generation of high-quality videos, supporting two scenarios: text-to-video and image + text-to-video.

Text-to-Video English

Wan2.1 14B VACE GGUF

The GGUF format version of the Wan2.1-VACE-14B model, mainly used for text-to-video generation tasks.

Animatediff Lightning

Ultra-fast text-to-video model, generating videos over ten times faster than the original AnimateDiff

V-Express is an audio and facial keypoint condition-based video generation model capable of converting audio input into dynamic video output.

Text-to-Video English

CogVideoX is the open-source version of the video generation model derived from Qingying, providing high-quality video generation capabilities.

Text-to-Video English

Llava NeXT Video 7B Hf

LLaVA-NeXT-Video is an open-source multimodal chatbot that achieves excellent video understanding capabilities through mixed training on video and image data, reaching SOTA level among open-source models on the VideoMME benchmark.

Transformers English

Wan2.1 T2V 14B Diffusers

Wan2.1 is a comprehensive open-source video foundation model designed to push the boundaries of video generation, supporting tasks such as text-to-video in Chinese and English, image-to-video, and more.

Text-to-Video Supports Multiple Languages

Wan2.1 T2V 1.3B Diffusers

Wan 2.1 is a comprehensive open-source video foundation model featuring top-tier performance, consumer-grade GPU support, multi-task capabilities, visual-text generation, and efficient video VAE.

Text-to-Video Supports Multiple Languages

Wan 2.1 is a comprehensive open-source video foundation model capable of multiple tasks including text-to-video, image-to-video, video editing, text-to-image, and video-to-audio generation, with support for bilingual Chinese-English text input.

Text-to-Video Supports Multiple Languages

Wan2.1 T2V 14B Gguf

A text-to-video generation model converted to GGUF format, supporting usage via ComfyUI-GGUF custom nodes

CogVideoX is an open-source video generation model originating from Qingying. The 2B version is an entry-level model, balancing compatibility with low operational and development costs.

Text-to-Video English

AnimateLCM is an efficient personalized stylized video generation model that does not require personalized video data, capable of generating high-quality videos with just 4 inference steps.

The GGUF quantized version of Wan Video is a text-to-video generation model suitable for older or low-end machines, supporting efficient inference via GGUF files.

Text-to-Video English

Ltxv 13b 0.9.7 Dev GGUF

GGUF quantized version of the 13b-0.9.7-dev variant based on Lightricks/LTX-Video, supporting text-to-video and image-to-video generation tasks.

Text-to-Video English

Wan2.1 Fun 1.3B Control

Wan2.1-Fun-1.3B is a text-to-video generation model that supports multi-resolution training and first/last frame prediction.

Text-to-Video Supports Multiple Languages

Wan2.1 T2V 1.3B

Wan 2.1 is a comprehensive open-source video foundation model designed to push the boundaries of video generation, supporting tasks such as text-to-video and image-to-video generation.

Text-to-Video Supports Multiple Languages

Clip4clip Webvid150k

A CLIP4Clip video-text retrieval model trained on a subset of the WebVid dataset for large-scale video-text retrieval applications

Text To Video Ms 1.7b

Based on a multi-stage text-to-video diffusion model, it generates videos matching English text descriptions

Wan2.1 Fun 14B InP Gguf

A 14B-parameter multimodal model released by Alibaba PAI, supporting text-to-video generation tasks

Text-to-Video Supports Multiple Languages

Zeroscope V2 576w

A watermark-free video generation model based on Modelscope, optimized for 16:9 aspect ratio and smooth video output

Cogvideox1.5 5B

CogVideoX is an open-source video generation model similar to Qingying, supporting high-resolution video generation

Text-to-Video English

Wan2.1 Fun 14B Control

A text-to-video model supporting multi-resolution training and first/last frame prediction

Text-to-Video Supports Multiple Languages

VACE Wan2.1 1.3B Preview

VACE is an all-round video creation and editing model that supports various tasks such as reference video generation, video-to-video editing, and masked video-to-video editing.

Text-to-Video Supports Multiple Languages

Wan2.1 VACE 14B

Wan2.1 is a comprehensive and open video foundation model designed to push the boundaries of video generation, supporting various video generation and editing tasks.

Text-to-Video Supports Multiple Languages

Llava NeXT Video 7B DPO

LLaVA-Next-Video is an open-source multimodal dialogue model, fine-tuned with multimodal instruction-following data on large language models, supporting video and text multimodal interactions.

A GGUF quantized version based on the Lightricks/LTX-Video model, supporting text-to-video, image-to-video, and video-to-video tasks.

Text-to-Video English

Wan2.1 Fun 14B InP

A text-to-video model developed by Alibaba Cloud PAI team, supporting multi-resolution training and first-last frame prediction

Text-to-Video Supports Multiple Languages

Wan2.1 Fun 1.3B InP

Wan2.1-Fun-1.3B is a text-to-video generation model developed by Alibaba PAI team, supporting multi-resolution training and first/last frame prediction.

Text-to-Video Supports Multiple Languages

Cosmos Reason1 7B GGUF

Cosmos-Reason1 is a physics AI model developed by NVIDIA, capable of understanding physical common sense and generating embodied decision-making natural language through long-chain reasoning.

Transformers English

Wan2.1 is an open and advanced large-scale video generation model that supports various tasks including text-to-video and image-to-video, compatible with consumer-grade GPUs.

Text-to-Video Supports Multiple Languages

Ltxv 13b 0.9.7 Distilled GGUF

LTX-Video is a text-to-video generation model that supports creating video content from text or images.

Text-to-Video English

Hunyuanvideo Gguf

GGUF quantized version of Tencent's Phantom Video model, designed specifically for ComfyUI for text-to-video generation tasks

Animatediff Motion Lora Tilt Up

Dynamic LoRAs model that adds specific types of motion effects to animations

MoviiGen 1.1 is a cinematic video generation model fine-tuned based on Wan2.1, excelling in film aesthetics and visual quality.

Text-to-Video English

Wan2.1 Fun 14B Control Gguf

A 14B-parameter multimodal model released by Alibaba PAI, supporting text-to-video generation tasks

Text-to-Video Supports Multiple Languages

Xclip Base Patch16 Zero Shot

X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained contrastively on (video, text) pairs, suitable for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.

Transformers English

Cosmos 1.0 Diffusion 7B Text2World

A multimodal world foundation model based on diffusion architecture developed by NVIDIA, capable of generating high-quality physics-aware videos from text inputs

LTX Video Diffusers

Diffusers-based implementation of the LTX-Video model, supporting high-quality video generation from text or images

An open-source video synthesis codebase developed by Alibaba's Tongyi Lab, integrating multiple advanced video generation models

LTX Video 0.9.1 Diffusers

LTX-Video model in Diffusers format, supporting text-to-video and image-to-video generation

Skyreels V2 T2V 14B 720P

SkyReels V2 is an unlimited-length cinematic generation model that employs an autoregressive diffusion-forced architecture, supporting high-resolution video generation.

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase