Text-to-video-ms-1.7b Open-source Model - Generate Corresponding Videos for Free by Inputting English Texts

Text To Video Ms 1.7b

Developed by vdo

Based on the multi-stage text-to-video diffusion model architecture, inputting English descriptive text can generate video content that matches the description.

Text-to-Video #English text-to-video generation #Multi-stage diffusion model #Open-domain content generation

Downloads 55

Release Time : 5/7/2023

Model Overview

The text-to-video diffusion model consists of three sub-networks: text feature extraction, diffusion model, and video generation. The total number of parameters is approximately 1.7 billion, supporting the generation of dynamic video content from text descriptions.

Model Features

Multi-stage generation architecture

It includes three sub-networks: text feature extraction, diffusion model, and video generation, enabling high-quality video generation.

Long video generation ability

Through optimization techniques, videos up to 25 seconds long can be generated with a 16GB video memory.

Open-domain generation

Supports the generation of videos described by any English text, with a wide range of application scenarios.

Model Capabilities

Text-to-video generation

Open-domain content creation

Dynamic scene synthesis

Use Cases

Creative content generation

Concept visualization

Convert abstract text descriptions into visual video content.

Generate dynamic scenes that match the text description.

Educational demonstration

Generate visual demonstration videos of teaching concepts.

Help understand complex concepts.

Entertainment content creation

Short video generation

Generate short video content based on creative text.

The examples show creative scenes such as an astronaut riding a horse and Darth Vader surfing.

🚀 Text-to-video-synthesis Model in Open Domain

This is a diffusion-based text-to-video generation model that takes English text descriptions as input and generates corresponding videos.

🚀 Quick Start

This model is a diffusion-based text-to-video generation model. It can generate videos based on arbitrary English text descriptions.

Let's first install the required libraries:

$ pip install diffusers transformers accelerate

Now, generate a video:

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)

✨ Features

Multi - stage Generation: The model consists of three sub - networks for text feature extraction, text feature - to - video latent space diffusion, and video latent space to video visual space transformation.
English - only Support: Currently, it only supports English input for video generation.
Wide Application: It can reason and generate videos based on arbitrary English text descriptions.

📦 Installation

To use this model, you need to install the following libraries:

$ pip install diffusers transformers accelerate

💻 Usage Examples

Basic Usage

Generate a video based on a simple text description:

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)

Advanced Usage

Optimize memory usage to generate longer videos:

$ pip install git+https://github.com/huggingface/diffusers transformers accelerate

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

# load pipeline
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# optimize for GPU memory
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()

# generate
prompt = "Spiderman is surfing. Darth Vader is also surfing and following Spiderman"
video_frames = pipe(prompt, num_inference_steps=25, num_frames=200).frames

# convent to video
video_path = export_to_video(video_frames)

📚 Documentation

Model Details

Property	Details
Developed by	ModelScope
Model Type	Diffusion-based text-to-video generation model
Language(s)	English
License	CC - BY - NC - ND
Resources for more information	ModelScope GitHub Repository, Summary

Use cases

This model can be used to generate videos based on arbitrary English text descriptions, with a wide range of applications.

Model limitations and biases

The model is trained based on public data sets such as Webvid, and the generated results may have deviations related to the distribution of training data.
This model cannot achieve perfect film and television quality generation.
The model cannot generate clear text.
The model is mainly trained with English corpus and does not support other languages at the moment.
The performance of this model needs to be improved on complex compositional generation tasks.

Misuse, Malicious Use and Excessive Use

The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model's capabilities.
It is prohibited to generate content that is demeaning or harmful to people or their environment, culture, religion, etc.
Prohibited for pornographic, violent and bloody content generation.
Prohibited for error and false information generation.

Training data

The training data includes LAION5B, ImageNet, Webvid and other public datasets. Image and video filtering is performed after pre - training such as aesthetic score, watermark score, and deduplication.

Citation

    @InProceedings{VideoFusion,
        author    = {Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
        title     = {VideoFusion: Decomposed Diffusion Models for High - Quality Video Generation},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        month     = {June},
        year      = {2023}
    }

🔧 Technical Details

The text - to - video generation diffusion model consists of three sub - networks: text feature extraction model, text feature - to - video latent space diffusion model, and video latent space to video visual space model. The overall model parameters are about 1.7 billion. The diffusion model adopts a UNet3D structure, and implements video generation through the iterative denoising process from the pure Gaussian noise video.

📄 License

This model is licensed under CC - BY - NC - ND.

⚠️ Important Note

This model is meant for research purposes. Please look at the model limitations and biases and misuse, malicious use and excessive use sections.

We Are Hiring! (Based in Beijing / Hangzhou, China.)

If you're looking for an exciting challenge and the opportunity to work with cutting - edge technologies in AIGC and large - scale pretraining, then we are the place for you. We are looking for talented, motivated and creative individuals to join our team. If you are interested, please send your CV to us.

EMAIL: yingya.zyy@alibaba - inc.com

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご