Allegro-T2V-40x360P开源文本生成视频模型 - 免费生成多样高质量动态视频

首页

Allegro T2V 40x360P

由 rhymes-ai 开发

Allegro是一款开源的文本生成视频模型，支持生成高质量、多样化的动态场景视频。

文本生成视频英语开源协议:Apache-2.0 #高清视频生成 #轻量级架构 #动态场景建模

下载量 21

发布时间 : 12/17/2024

模型简介

Allegro是一个先进的文本到视频生成模型，能够根据文本描述生成高质量的视频内容，适用于从人类/动物特写到多样化动态场景的广泛创作需求。

模型特点

开源共享

完整模型权重与代码已开放，采用Apache 2.0协议

多元创作

支持生成人类/动物特写到多样化动态场景的广泛内容

高清输出

可生成368x640和720x1280分辨率、15帧率的2-6秒精细视频

轻量高效

包含1.75亿参数VideoVAE与28亿参数VideoDiT模型，支持多精度推理

模型能力

文本生成视频

高质量视频生成

多样化场景创作

高清视频输出

使用案例

创意内容生成

动态场景创作

根据文本描述生成各种动态场景视频

生成2-6秒的高质量视频

特写视频生成

生成人类或动物的特写视频

精细的人物或动物特写视频

影视制作辅助

概念视频预览

快速生成影视概念的视频预览

帮助影视制作团队快速可视化创意

🚀 Allegro - 文本到视频生成模型

Allegro 是一款开源的文本到视频生成模型，能够根据输入的文本生成高质量的视频内容。它具有广泛的内容创作能力、高效的模型结构和出色的输出质量，为视频生成领域带来了新的可能性。

🚀 快速开始

安装必要依赖

确保 Python 版本 >= 3.10，PyTorch 版本 >= 2.4，CUDA 版本 >= 12.4。
建议使用 Anaconda 创建一个新的环境（Python >= 3.10），运行命令 conda create -n rllegro python=3.10 -y 来运行以下示例。
运行命令 pip install git+https://github.com/huggingface/diffusers.git torch==2.4.1 transformers==4.40.1 accelerate sentencepiece imageio imageio-ffmpeg beautifulsoup4。

运行推理

import torch
from diffusers import AutoencoderKLAllegro, AllegroPipeline
from diffusers.utils import export_to_video
vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro-T2V-40x360P", subfolder="vae", torch_dtype=torch.float32)
vae.tile_overlap_t = 8
vae.tile_overlap_h = 144
vae.tile_overlap_w = 64
vae.stride = (16,112,192)

pipe = AllegroPipeline.from_pretrained(
    "rhymes-ai/Allegro-T2V-40x360P", vae=vae, torch_dtype=torch.bfloat16
)
pipe.to("cuda")
pipe.vae.enable_tiling()

prompt = "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats."

positive_prompt = """
(masterpiece), (best quality), (ultra-detailed), (unwatermarked), 
{} 
emotional, harmonious, vignette, 4k epic detailed, shot on kodak, 35mm photo, 
sharp focus, high budget, cinemascope, moody, epic, gorgeous
"""

negative_prompt = """
nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, 
low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.
"""

prompt = prompt.format(prompt.lower().strip())

video = pipe(prompt, negative_prompt=negative_prompt, guidance_scale=7.5, max_sequence_length=512, num_inference_steps=100, generator = torch.Generator(device="cuda:0").manual_seed(42)).frames[0]
export_to_video(video, "output.mp4", fps=15)

使用 pipe.enable_sequential_cpu_offload() 可以将模型卸载到 CPU 以减少 GPU 内存消耗，但推理时间会显著增加。

（可选）将视频插值到 30 FPS

建议使用 EMA-VFI 将视频从 15 FPS 插值到 30 FPS。为了获得更好的视觉质量，请使用 imageio 保存视频。

更快的推理方法

如需了解如 Context Parallel、PAB 等更快的推理方法，请参考我们的 GitHub 仓库。

✨ 主要特性

开源：社区可获取完整的模型权重和代码，采用 Apache 2.0 许可证！
多功能内容创作：能够生成广泛的内容，从人类和动物的特写镜头到各种动态场景。
高质量输出：以 15 FPS 生成 2 到 6 秒的详细视频，分辨率为 368x640 和 720x1280，可使用 EMA-VFI 插值到 30 FPS。
小巧高效：具有 1.75 亿参数的 VideoVAE 和 28 亿参数的 VideoDiT 模型。支持多种精度（FP32、BF16、FP16），在 BF16 模式下使用 CPU 卸载时，GPU 内存使用量为 9.3 GB。上下文长度为 79.2K，相当于 88 帧。

📦 模型信息

属性	详情
模型	Allegro-T2V-40x360P
描述	文本到视频生成模型
下载地址	Hugging Face
参数	VAE: 1.75 亿；DiT: 28 亿
推理精度	VAE: FP32/TF32/BF16/FP16（FP32/TF32 最佳）；DiT/T5: BF16/FP32/TF32
上下文长度	9.2K
分辨率	368 x 640
帧数	40
视频长度	约 3 秒 @ 15 FPS