Allegro-T2V-40x720P开源文本到视频模型 - 免费生成2至6秒详细视频，多分辨率支持

首页

Allegro T2V 40x720P

由 rhymes-ai 开发

Allegro是一款开源的高质量文本到视频生成模型，能够生成2至6秒、15 FPS的详细视频，支持多种分辨率。

文本生成视频英语开源协议:Apache-2.0 #高清视频生成 #长序列建模 #轻量级架构

下载量 21

发布时间 : 12/17/2024

模型简介

Allegro是一个先进的文本到视频生成模型，能够根据文本提示生成高质量的视频内容。它支持多种分辨率（368x640和720x1280），并可通过插帧技术提升至30 FPS。

模型特点

开源

完整模型权重和代码向社区开放，采用Apache 2.0协议。

多样化内容创作

能够生成从人类和动物特写到各种动态场景的广泛内容。

高质量输出

生成2至6秒、15 FPS、分辨率为368x640和720x1280的详细视频，可通过插帧至30 FPS。

轻量高效

包含1.75亿参数的VideoVAE和28亿参数的VideoDiT模型。支持多种精度，在BF16模式下启用CPU卸载时仅占用9.3 GB显存。

模型能力

文本到视频生成

高质量视频合成

多样化内容创作

视频插帧支持

使用案例

创意内容生成

广告视频生成

根据产品描述生成高质量的广告视频。

生成2至6秒的广告视频，可用于社交媒体推广。

动画短片创作

根据故事情节生成动画短片。

生成具有丰富细节的动画短片，适用于创意项目。

教育

教学视频生成

根据教学内容生成辅助视频。

生成高质量的教学视频，提升学习体验。

🚀 Allegro - 文本到视频模型

Allegro是一个开源的文本到视频生成模型，能够生成多种内容的高质量视频，具有参数小、效率高的特点，为视频内容创作提供了强大的支持。

🚀 快速开始

安装必要依赖

确保Python版本 >= 3.10，PyTorch版本 >= 2.4，CUDA版本 >= 12.4。
建议使用Anaconda创建一个新的环境（Python >= 3.10），运行命令 conda create -n rllegro python=3.10 -y ，然后在该环境中运行以下示例。
运行命令 pip install git+https://github.com/huggingface/diffusers.git torch==2.4.1 transformers==4.40.1 accelerate sentencepiece imageio imageio-ffmpeg beautifulsoup4

运行推理

import torch
from diffusers import AutoencoderKLAllegro, AllegroPipeline
from diffusers.utils import export_to_video
vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro-T2V-40x720P", subfolder="vae", torch_dtype=torch.float32)

pipe = AllegroPipeline.from_pretrained(
    "rhymes-ai/Allegro-T2V-40x720P", vae=vae, torch_dtype=torch.bfloat16
)
pipe.to("cuda")
pipe.vae.enable_tiling()

prompt = "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats."

positive_prompt = """
(masterpiece), (best quality), (ultra-detailed), (unwatermarked), 
{} 
emotional, harmonious, vignette, 4k epic detailed, shot on kodak, 35mm photo, 
sharp focus, high budget, cinemascope, moody, epic, gorgeous
"""

negative_prompt = """
nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, 
low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.
"""

prompt = prompt.format(prompt.lower().strip())

video = pipe(prompt, negative_prompt=negative_prompt, guidance_scale=7.5, max_sequence_length=512, num_inference_steps=100, generator = torch.Generator(device="cuda:0").manual_seed(42)).frames[0]
export_to_video(video, "output.mp4", fps=15)

使用 pipe.enable_sequential_cpu_offload() 可以将模型卸载到CPU以减少GPU内存消耗，但推理时间会显著增加。

（可选）将视频插值到30 FPS

建议使用 EMA-VFI 将视频从15 FPS插值到30 FPS。为了获得更好的视觉质量，请使用imageio保存视频。

更快的推理方法

如需更快的推理方法，如Context Parallel、PAB，请参考我们的 GitHub仓库。

✨ 主要特性

开源：社区可获取完整的模型权重和代码，采用Apache 2.0许可证！
内容创作多样：能够生成广泛的内容，从人类和动物的特写镜头到各种动态场景。
高质量输出：以15 FPS的帧率生成2到6秒的详细视频，分辨率为368x640和720x1280，可使用 EMA-VFI 插值到30 FPS。
小巧高效：具有1.75亿参数的VideoVAE和28亿参数的VideoDiT模型。支持多种精度（FP32、BF16、FP16），在BF16模式下使用CPU卸载时，GPU内存使用量为9.3 GB。上下文长度为79.2K，相当于88帧。

📦 模型信息

属性	详情
模型名称	Allegro-T2V-40x720P
描述	文本到视频生成模型
下载地址	Hugging Face
参数	VAE: 1.75亿；DiT: 28亿
推理精度	VAE: FP32/TF32/BF16/FP16（FP32/TF32最佳）；DiT/T5: BF16/FP32/TF32
上下文长度	36K
分辨率	720 x 1280
帧数	40
视频长度	3秒 @ 15 FPS