Text-to-video-ms-1.7b-legacy open-source model - Generate matching videos for free by entering English text

Text To Video Ms 1.7b Legacy

Developed by ali-vilab

Based on a multi-stage text-to-video diffusion model, inputting English descriptive text can generate videos that match the description

Text-to-Video #English text-to-video generation #Multi-stage diffusion model #Dynamic scene generation

Downloads 133

Release Time : 3/22/2023

Model Overview

This model consists of a text feature extraction model, a text feature-to-video latent space diffusion model, and a video latent space-to-video visual space model. It uses a UNet3D structure and achieves video generation through iterative denoising

Model Features

Multi-stage generation architecture

Adopts a three-stage architecture of text feature extraction, latent space diffusion, and visual space conversion

Long video generation ability

Can generate videos up to 25 seconds long through memory optimization technology

High-quality video generation

Can generate coherent video content that matches the text description

Model Capabilities

Text-to-video generation

English text understanding

Dynamic scene generation

Use Cases

Creative content generation

Fictional scene generation

Generate videos based on imagined scenes, such as an astronaut riding a horse

Generate dynamic videos that match the description

Character action generation

Generate action videos for specific characters, such as Spider-Man surfing

Generate videos of the character performing the specified action

Educational demonstration

Concept visualization

Convert abstract concepts into visual videos

🚀 Text-to-video-synthesis Model in Open Domain

This model is based on a multi - stage text - to - video generation diffusion model. It takes a description text as input and returns a video matching the text description. Only English input is supported.

🚀 Quick Start

Let's first install the libraries required:

$ pip install git+https://github.com/huggingface/diffusers transformers accelerate

Now, generate a video:

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b-legacy", torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)

✨ Features

Wide Application: This model has a wide range of applications, and can reason and generate videos based on arbitrary English text descriptions.
Multi - stage Architecture: The text - to - video generation diffusion model consists of three sub - networks: text feature extraction model, text feature - to - video latent space diffusion model, and video latent space to video visual space model.

📦 Installation

$ pip install git+https://github.com/huggingface/diffusers transformers accelerate

💻 Usage Examples

Basic Usage

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b-legacy", torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)

Advanced Usage

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

# load pipeline
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# optimize for GPU memory
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()

# generate
prompt = "Spiderman is surfing. Darth Vader is also surfing and following Spiderman"
video_frames = pipe(prompt, num_inference_steps=25, num_frames=200).frames

# convent to video
video_path = export_to_video(video_frames)

📚 Documentation

Model description

The text - to - video generation diffusion model consists of three sub - networks: text feature extraction model, text feature - to - video latent space diffusion model, and video latent space to video visual space model. The overall model parameters are about 1.7 billion. Currently, it only supports English input. The diffusion model adopts a UNet3D structure, and implements video generation through the iterative denoising process from the pure Gaussian noise video.

This model is meant for research purposes. Please look at the model limitations and biases and misuse, malicious use and excessive use sections.

Model Details

Property	Details
Developed by	ModelScope
Model Type	Diffusion - based text - to - video generation model
Language(s)	English
License	[CC - BY - NC - ND](https://creativecommons.org/licenses/by - nc - nd/4.0/)
Resources for more information	ModelScope GitHub Repository, [Summary](https://modelscope.cn/models/damo/text - to - video - synthesis/summary).

Use cases

This model has a wide range of applications, and can reason and generate videos based on arbitrary English text descriptions.

View results

The above code will display the save path of the output video, and the current encoding format can be played with VLC player.

The output mp4 file can be viewed by VLC media player. Some other media players may not view it normally.

Model limitations and biases

The model is trained based on public data sets such as Webvid, and the generated results may have deviations related to the distribution of training data.
This model cannot achieve perfect film and television quality generation.
The model cannot generate clear text.
The model is mainly trained with English corpus and does not support other languages at the moment.
The performance of this model needs to be improved on complex compositional generation tasks.

Misuse, Malicious Use and Excessive Use

The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model's capabilities.
It is prohibited to generate content that is demeaning or harmful to people or their environment, culture, religion, etc.
Prohibited for pornographic, violent and bloody content generation.
Prohibited for error and false information generation.

Training data

The training data includes [LAION5B](https://huggingface.co/datasets/laion/laion2B - en), [ImageNet](https://www.image - net.org/), [Webvid](https://m - bain.github.io/webvid - dataset/) and other public datasets. Image and video filtering is performed after pre - training such as aesthetic score, watermark score, and deduplication.

(Part of this model card has been taken from [here](https://huggingface.co/damo - vilab/modelscope - damo - text - to - video - synthesis))

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご