Text2Motion Open-Source Video Generation Model Suite - Supports Video Generation Tasks from Text and Images

Text2motion

Developed by Quantamhash

An open and advanced large-scale video generation model suite supporting multiple tasks including text-to-video and image-to-video generation

Text-to-Video EnglishOpen Source License:Apache-2.0 #Bilingual video generation #High dynamic range scenes #Consumer-grade GPU compatibility

Downloads 233

Release Time : 3/21/2025

Model Overview

Text-to-Motion is a comprehensive open-source video foundation model suite that pushes the boundaries of video generation, supporting bilingual text input (Chinese/English) and dual resolutions (480P/720P)

Model Features

State-of-the-art performance

Outperforms existing open-source models and commercial solutions across multiple benchmarks

Consumer GPU support

T2V-1.3B model requires only 8.19GB VRAM, generating 5-second 480P video in ~4 minutes on RTX 4090

Multi-task capability

Supports various tasks including text-to-video, image-to-video, and video editing

Bilingual text generation

First video generation model supporting both Chinese and English text input

Efficient video VAE

Maintains temporal information when encoding/decoding arbitrary-length 1080P videos with optimal efficiency and performance

Model Capabilities

Text-to-video

Image-to-video

Video editing

Text-to-image

Video-to-audio

Use Cases

Entertainment content creation

Animated short generation

Generate anthropomorphic animal animations from text descriptions

Example: Generate 480P/720P video of two anthropomorphic cats boxing

Advertisement production

Product showcase videos

Automatically generate product demonstration videos from descriptions

🚀 Text2Motion

Text2Motion: Open and advanced large - scale video generative models that redefine the boundaries of video generation.

In this repository, we present Text2Motion, a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. Text2Motion offers these key features:

👍 SOTA Performance: Text2Motion consistently outperforms existing open - source models and state - of - the - art commercial solutions across multiple benchmarks.
👍 Supports Consumer - grade GPUs: The T2V - 1.3B model requires only 8.19 GB VRAM, making it compatible with almost all consumer - grade GPUs. It can generate a 5 - second 480P video on an RTX 4090 in about 4 minutes (without optimization techniques like quantization). Its performance is even comparable to some closed - source models.
👍 Multiple Tasks: Text2Motion excels in Text - to - Video, Image - to - Video, Video Editing, Text - to - Image, and Video - to - Audio, advancing the field of video generation.
👍 Visual Text Generation: Text2Motion is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications.
👍 Powerful Video VAE: Text2Motion - VAE delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.

This repository features our T2V - 14B model, which establishes a new SOTA performance benchmark among both open - source and closed - source models. It demonstrates exceptional capabilities in generating high - quality visuals with significant motion dynamics. It is also the only video model capable of producing both Chinese and English text and supports video generation at both 480P and 720P resolutions.

✨ Features

SOTA Performance: Outperforms other models in multiple benchmarks.
Consumer - grade GPU Support: Compatible with most consumer - grade GPUs.
Multiple Tasks: Capable of handling various video - related tasks.
Visual Text Generation: Can generate both Chinese and English text.
Powerful Video VAE: Efficiently encodes and decodes 1080P videos.

🚀 Quick Start

📦 Installation

Clone the repo:

git clone https://huggingface.co/sbapan41/Text2Motion
cd Text2Motion

Install dependencies:

# Ensure torch >= 2.4.0
pip install -r requirements.txt

📥 Model Download

Property	Details
Model Type	T2V - 14B
Download Link	🤗 Huggingface
Notes	Supports both 480P and 720P

Download models using 🤗 huggingface - cli:

pip install "huggingface_hub[cli]"
huggingface-cli download sbapan41/Text2Motion --local-dir ./Text2Motion

💻 Usage Examples

Basic Usage

This repository supports two Text - to - Video models (14B) and two resolutions (480P and 720P). The parameters and configurations for these models are as follows:

Task	480P	720P	Model
t2v - 14B	✔️	✔️	Text2Motion - 14B

(1) Without Prompt Extention

To facilitate implementation, we will start with a basic version of the inference process that skips the prompt extension step.

Single - GPU inference

python generate.py  --task 14B --size 1280*720 --ckpt_dir ./Text2Motion --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

If you encounter OOM (Out - of - Memory) issues, you can use the --offload_model True and --t5_cpu options to reduce GPU memory usage. For example, on an RTX 4090 GPU:

Multi - GPU inference using FSDP + xDiT USP

pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task 14B --size 1280*720 --ckpt_dir ./Text2Motion --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

🔥 Latest News!!

Feb 22, 2025: 👋 We've released the inference code and weights of Text2Motion.

📋 Todo List

Text2Motion Text - to - Video
- [x] Multi - GPU Inference code of the 14B
- [x] Checkpoints of the 14B
- [x] Gradio demo
- [ ] Diffusers integration
- [ ] ComfyUI integration
Text2Motion Image - to - Video
- [x] Multi - GPU Inference code of the 14B model
- [x] Checkpoints of the 14B model
- [x] Gradio demo
- [ ] Diffusers integration
- [ ] ComfyUI integration

🔧 Technical Details

Model	Dimension	Input Dimension	Output Dimension	Feedforward Dimension	Frequency Dimension	Number of Heads	Number of Layers
14B	5120	16	16	13824	256	40	40

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご