Text-to-Video-MS-1.7B Open-Source Model - Input English text to freely generate videos that match the description

Text To Video Ms 1.7b

Developed by ali-vilab

Based on a multi-stage text-to-video diffusion model, it generates videos matching English text descriptions

Text-to-Video #Multi-stage diffusion model #English text-to-video generation #UNet3D architecture

Downloads 14.01k

Release Time : 3/22/2023

Model Overview

The text-to-video diffusion model consists of three subnetworks: text feature extraction model, text feature-to-video latent space diffusion model, and video latent space-to-visual space model. The total model has approximately 1.7 billion parameters and currently only supports English input.

Model Features

Multi-stage generation architecture

Composed of three subnetworks: text feature extraction, text feature-to-video latent space diffusion, and video latent space-to-visual space

Long video generation capability

Through optimization techniques, can generate videos up to 25 seconds long within 16GB GPU memory

Memory optimization technology

Supports attention mechanisms and VAE slicing technology, combined with Torch 2.0 for efficient memory utilization

Model Capabilities

Text-to-video generation

Open-domain video creation

Multi-object scene synthesis

Use Cases

Creative content generation

Fictional scene creation

Generate videos of fictional characters in unreal scenarios, such as an astronaut riding a horse

Can produce smooth animations of fictional scenes

Concept visualization

Transform abstract concepts or text descriptions into visual videos

Quickly achieve visual expression of creative concepts

Education and entertainment

Educational content production

Create supporting video materials for educational content

Simplify the educational video production process

🚀 Text-to-video-synthesis Model in Open Domain

This model is a multi - stage text - to - video generation diffusion model. It takes a description text as input and outputs a video that matches the text description. Currently, it only supports English input.

We are actively seeking talented individuals to join our team, based in Beijing or Hangzhou, China. If you're eager for a challenging role and the chance to work with cutting - edge AIGC and large - scale pretraining technologies, we'd love to hear from you. Send your CV to yingya.zyy@alibaba - inc.com if you're interested.

🚀 Quick Start

Let's first install the necessary libraries:

$ pip install diffusers transformers accelerate torch

Now, generate a video:

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo - vilab/text - to - video - ms - 1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)

✨ Features

This model can reason and generate videos based on arbitrary English text descriptions, with a wide range of applications.
It can optimize for memory usage by enabling attention and VAE slicing and using Torch 2.0, allowing video generation up to 25 seconds on less than 16GB of GPU VRAM.

📦 Installation

$ pip install diffusers transformers accelerate torch

For long - video generation:

$ pip install git+https://github.com/huggingface/diffusers transformers accelerate

💻 Usage Examples

Basic Usage

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo - vilab/text - to - video - ms - 1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)

Advanced Usage

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

# load pipeline
pipe = DiffusionPipeline.from_pretrained("damo - vilab/text - to - video - ms - 1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# optimize for GPU memory
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()

# generate
prompt = "Spiderman is surfing. Darth Vader is also surfing and following Spiderman"
video_frames = pipe(prompt, num_inference_steps=25, num_frames=200).frames

# convent to video
video_path = export_to_video(video_frames)

📚 Documentation

Model description

The text - to - video generation diffusion model consists of three sub - networks: text feature extraction model, text feature - to - video latent space diffusion model, and video latent space to video visual space model. The overall model parameters are about 1.7 billion. Currently, it only supports English input. The diffusion model adopts a UNet3D structure, and implements video generation through the iterative denoising process from the pure Gaussian noise video.

This model is for research purposes. Please refer to the [model limitations and biases and misuse](#model - limitations - and - biases), [malicious use and excessive use](#misuse - malicious - use - and - excessive - use) sections.

Model Details

Property	Details
Developed by	ModelScope
Model Type	Diffusion - based text - to - video generation model
Language(s)	English
License	[CC - BY - NC - ND](https://creativecommons.org/licenses/by - nc - nd/4.0/)
Resources for more information	ModelScope GitHub Repository, [Summary](https://modelscope.cn/models/damo/text - to - video - synthesis/summary)
Cite as	See Citation section

Use cases

This model can reason and generate videos based on arbitrary English text descriptions, with a wide range of applications.

View results

The above code will display the save path of the output video, and the current encoding format can be played with VLC player. The output mp4 file can be viewed by VLC media player. Some other media players may not view it normally.

Model limitations and biases

The model is trained based on public data sets such as Webvid, and the generated results may have deviations related to the distribution of training data.
This model cannot achieve perfect film and television quality generation.
The model cannot generate clear text.
The model is mainly trained with English corpus and does not support other languages at the moment.
The performance of this model needs to be improved on complex compositional generation tasks.

Misuse, Malicious Use and Excessive Use

The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model's capabilities.
It is prohibited to generate content that is demeaning or harmful to people or their environment, culture, religion, etc.
Prohibited for pornographic, violent and bloody content generation.
Prohibited for error and false information generation.

Training data

The training data includes [LAION5B](https://huggingface.co/datasets/laion/laion2B - en), [ImageNet](https://www.image - net.org/), [Webvid](https://m - bain.github.io/webvid - dataset/) and other public datasets. Image and video filtering is performed after pre - training such as aesthetic score, watermark score, and deduplication.

Citation

    @article{wang2023modelscope,
      title={Modelscope text - to - video technical report},
      author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
      journal={arXiv preprint arXiv:2308.06571},
      year={2023}
    }
    @InProceedings{VideoFusion,
        author    = {Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
        title     = {VideoFusion: Decomposed Diffusion Models for High - Quality Video Generation},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        month     = {June},
        year      = {2023}
    }

⚠️ Important Note

This model is for research purposes. Please be aware of the model limitations and biases, and avoid malicious and excessive use.

💡 Usage Tip

Use VLC player to view the output video. For long - video generation, enable attention and VAE slicing and use Torch 2.0 to optimize memory usage.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご