Allegro-T2V-40x720P Open-source Text-to-Video Model - Freely Generate 2 to 6-second Detailed Videos with Multi-resolution Support

Allegro T2V 40x720P

Developed by rhymes-ai

Allegro is an open-source high-quality text-to-video generation model capable of producing detailed videos lasting 2 to 6 seconds at 15 FPS, supporting multiple resolutions.

Text-to-Video EnglishOpen Source License:Apache-2.0 #High-quality video generation #Long-sequence modeling #Lightweight architecture

Downloads 21

Release Time : 12/17/2024

Model Overview

Allegro is an advanced text-to-video generation model that creates high-quality video content based on textual prompts. It supports multiple resolutions (368x640 and 720x1280) and can be enhanced to 30 FPS through frame interpolation.

Model Features

Open-source

Complete model weights and code are open to the community under the Apache 2.0 license.

Diverse content creation

Capable of generating a wide range of content, from close-ups of humans and animals to various dynamic scenes.

High-quality output

Generates detailed videos lasting 2 to 6 seconds at 15 FPS with resolutions of 368x640 and 720x1280, which can be interpolated to 30 FPS.

Lightweight and efficient

Includes a VideoVAE with 175 million parameters and a VideoDiT model with 2.8 billion parameters. Supports multiple precisions, occupying only 9.3 GB of VRAM in BF16 mode with CPU offloading enabled.

Model Capabilities

Text-to-video generation

High-quality video synthesis

Diverse content creation

Video frame interpolation support

Use Cases

Creative content generation

Ad video generation

Generate high-quality promotional videos based on product descriptions.

Produces 2 to 6-second promotional videos suitable for social media marketing.

Animated short film creation

Generate animated short films based on storylines.

Creates richly detailed animated short films for creative projects.

Education

Educational video generation

Generate supplementary videos based on teaching materials.

Produces high-quality educational videos to enhance learning experiences.

🚀 Allegro - Text-to-Video Generation

Allegro is an open - source text - to - video generation model. It offers versatile content creation, high - quality output, and is small and efficient. The model weights and code are publicly available, enabling the community to explore and utilize its capabilities.

Gallery · GitHub · Blog · Paper · Discord · Join Waitlist (Try it on Discord!)

🚀 Quick Start

Step 1: Install the necessary requirements

Ensure Python >= 3.10, PyTorch >= 2.4, CUDA >= 12.4.
It is recommended to use Anaconda to create a new environment (Python >= 3.10) conda create -n rllegro python=3.10 -y to run the following example.
Run pip install git+https://github.com/huggingface/diffusers.git torch==2.4.1 transformers==4.40.1 accelerate sentencepiece imageio imageio - ffmpeg beautifulsoup4

Step 2: Run inference

import torch
from diffusers import AutoencoderKLAllegro, AllegroPipeline
from diffusers.utils import export_to_video
vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro-T2V-40x720P", subfolder="vae", torch_dtype=torch.float32)

pipe = AllegroPipeline.from_pretrained(
    "rhymes-ai/Allegro-T2V-40x720P", vae=vae, torch_dtype=torch.bfloat16
)
pipe.to("cuda")
pipe.vae.enable_tiling()

prompt = "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats."

positive_prompt = """
(masterpiece), (best quality), (ultra-detailed), (unwatermarked), 
{} 
emotional, harmonious, vignette, 4k epic detailed, shot on kodak, 35mm photo, 
sharp focus, high budget, cinemascope, moody, epic, gorgeous
"""

negative_prompt = """
nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, 
low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.
"""

prompt = prompt.format(prompt.lower().strip())

video = pipe(prompt, negative_prompt=negative_prompt, guidance_scale=7.5, max_sequence_length=512, num_inference_steps=100, generator = torch.Generator(device="cuda:0").manual_seed(42)).frames[0]
export_to_video(video, "output.mp4", fps=15)

Use pipe.enable_sequential_cpu_offload() to offload the model into CPU for less GPU memory cost, but the inference time will increase significantly.

Step 3: (Optional) Interpolate the video to 30 FPS

It is recommended to use [EMA - VFI](https://github.com/MCG - NJU/EMA - VFI) to interpolate the video from 15 FPS to 30 FPS. For better visual quality, please use imageio to save the video.

Step 4: For faster inference

For faster inference such as Context Parallel, PAB, please refer to our [github repo](https://github.com/rhymes - ai/Allegro).

✨ Features

Open Source: Full [model weights](https://huggingface.co/rhymes - ai/Allegro) and [code](https://github.com/rhymes - ai/Allegro) available to the community, under the Apache 2.0 license!
Versatile Content Creation: Capable of generating a wide range of content, from close - ups of humans and animals to diverse dynamic scenes.
High - Quality Output: Generate detailed 2 to 6 - second videos at 15 FPS with 368x640 and 720x1280 resolution, which can be interpolated to 30 FPS with [EMA - VFI](https://github.com/MCG - NJU/EMA - VFI).
Small and Efficient: Features a 175M parameter VideoVAE and a 2.8B parameter VideoDiT model. Supports multiple precisions (FP32, BF16, FP16) and uses 9.3 GB of GPU memory in BF16 mode with CPU offloading. Context length is 79.2K, equivalent to 88 frames.

📦 Model Info

Property	Details
Model	Allegro - T2V - 40x720P
Description	Text - to - Video Generation Model
Download	[Hugging Face](https://huggingface.co/rhymes - ai/Allegro - T2V - 40x720P)
Parameter - VAE	175M
Parameter - DiT	2.8B
Inference Precision - VAE	FP32/TF32/BF16/FP16 (best in FP32/TF32)
Inference Precision - DiT/T5	BF16/FP32/TF32
Context Length	36K
Resolution	720 x 1280
Frames	40
Video Length	3 seconds @ 15 FPS

📚 Gallery

For more demos and corresponding prompts, see the [Allegro Gallery](https://rhymes.ai/allegro_gallery).

📄 License

This repo is released under the Apache 2.0 License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご