đ Text-to-video-synthesis Model in Open Domain
This is a diffusion-based text-to-video generation model that takes English text descriptions as input and generates corresponding videos.
đ Quick Start
This model is a diffusion-based text-to-video generation model. It can generate videos based on arbitrary English text descriptions.
Let's first install the required libraries:
$ pip install diffusers transformers accelerate
Now, generate a video:
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)
⨠Features
- Multi - stage Generation: The model consists of three sub - networks for text feature extraction, text feature - to - video latent space diffusion, and video latent space to video visual space transformation.
- English - only Support: Currently, it only supports English input for video generation.
- Wide Application: It can reason and generate videos based on arbitrary English text descriptions.
đĻ Installation
To use this model, you need to install the following libraries:
$ pip install diffusers transformers accelerate
đģ Usage Examples
Basic Usage
Generate a video based on a simple text description:
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)
Advanced Usage
Optimize memory usage to generate longer videos:
$ pip install git+https://github.com/huggingface/diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()
prompt = "Spiderman is surfing. Darth Vader is also surfing and following Spiderman"
video_frames = pipe(prompt, num_inference_steps=25, num_frames=200).frames
video_path = export_to_video(video_frames)
đ Documentation
Model Details
Use cases
This model can be used to generate videos based on arbitrary English text descriptions, with a wide range of applications.
Model limitations and biases
- The model is trained based on public data sets such as Webvid, and the generated results may have deviations related to the distribution of training data.
- This model cannot achieve perfect film and television quality generation.
- The model cannot generate clear text.
- The model is mainly trained with English corpus and does not support other languages at the moment.
- The performance of this model needs to be improved on complex compositional generation tasks.
Misuse, Malicious Use and Excessive Use
- The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model's capabilities.
- It is prohibited to generate content that is demeaning or harmful to people or their environment, culture, religion, etc.
- Prohibited for pornographic, violent and bloody content generation.
- Prohibited for error and false information generation.
Training data
The training data includes LAION5B, ImageNet, Webvid and other public datasets. Image and video filtering is performed after pre - training such as aesthetic score, watermark score, and deduplication.
Citation
@InProceedings{VideoFusion,
author = {Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
title = {VideoFusion: Decomposed Diffusion Models for High - Quality Video Generation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023}
}
đ§ Technical Details
The text - to - video generation diffusion model consists of three sub - networks: text feature extraction model, text feature - to - video latent space diffusion model, and video latent space to video visual space model. The overall model parameters are about 1.7 billion. The diffusion model adopts a UNet3D structure, and implements video generation through the iterative denoising process from the pure Gaussian noise video.
đ License
This model is licensed under CC - BY - NC - ND.
â ī¸ Important Note
This model is meant for research purposes. Please look at the model limitations and biases and misuse, malicious use and excessive use sections.
We Are Hiring! (Based in Beijing / Hangzhou, China.)
If you're looking for an exciting challenge and the opportunity to work with cutting - edge technologies in AIGC and large - scale pretraining, then we are the place for you. We are looking for talented, motivated and creative individuals to join our team. If you are interested, please send your CV to us.
EMAIL: yingya.zyy@alibaba - inc.com