Allegro開源文生視頻模型 - 免費生成720x1280分辨率6秒細節視頻

首頁

Allegro

由rhymes-ai開發

Allegro是一個開源的高質量文生視頻生成模型，能夠生成720x1280分辨率、15 FPS的6秒細節視頻。

文本生成視頻英語開源協議:Apache-2.0 #高清視頻生成 #輕量級模型 #多場景適配

下載量 250

發布時間 : 10/16/2024

模型概述

Allegro是一個基於Diffusers庫的文本到視頻生成模型，支持生成多樣化、高質量的視頻內容，適用於創意內容創作。

模型特點

開源

完整模型權重和代碼向社區開放，採用Apache 2.0協議。

多樣化內容創作

能夠生成從人物動物特寫到各類動態場景的廣泛內容。

高質量輸出

生成720x1280分辨率、15 FPS的6秒細節視頻，可通過EMA-VFI插幀至30 FPS。

輕量高效

包含1.75億參數VideoVAE和28億參數VideoDiT模型，支持多精度(FP32/BF16/FP16)。

模型能力

文本到視頻生成

高分辨率視頻生成

多樣化場景創作

動態內容生成

使用案例

創意內容創作

廣告視頻生成

根據文本描述生成創意廣告視頻。

高質量、富有情感的視頻內容

社交媒體內容

為社交媒體平臺生成吸引人的短視頻內容。

多樣化、高分辨率的視頻

教育

教學視頻生成

根據教學內容生成輔助視頻。

清晰、生動的教學材料

🚀 Allegro - 文本到視頻生成模型

Allegro 是一款開源的文本到視頻生成模型，能夠生成多種類型的高質量視頻內容。它具有參數小、效率高的特點，支持多種精度，為視頻創作提供了強大的工具。

示例畫廊 · GitHub · 博客 · 論文 · Discord · 加入等待列表 (在 Discord 上試用！)

🖼️ 示例畫廊

更多演示和對應的提示詞，請查看 [Allegro 示例畫廊](https://rhymes.ai/allegro_gallery)。

✨ 主要特性

開源共享：社區可獲取完整的模型權重和代碼，採用 Apache 2.0 許可證！
內容多樣：能夠生成廣泛的內容，從人物和動物特寫鏡頭到各種動態場景。
高質量輸出：以 15 FPS 生成 6 秒的詳細視頻，分辨率為 720x1280，可使用 EMA-VFI 插值到 30 FPS。
小巧高效：採用 1.75 億參數的 VideoVAE 和 28 億參數的 VideoDiT 模型。支持多種精度（FP32、BF16、FP16），在 BF16 模式下使用 CPU 卸載時，僅需 9.3 GB 的 GPU 內存。上下文長度為 79.2K，相當於 88 幀。

ℹ️ 模型信息

屬性	詳情
模型名稱	Allegro
模型描述	文本到視頻生成模型
下載地址	Hugging Face
模型參數	VAE: 1.75 億；DiT: 28 億
推理精度	VAE: FP32/TF32/BF16/FP16（FP32/TF32 最佳）；DiT/T5: BF16/FP32/TF32
上下文長度	79.2K
分辨率	720 x 1280
幀數	88
視頻長度	6 秒 @ 15 FPS
單 GPU 內存使用量	9.3G BF16（使用 CPU 卸載）

🚀 快速開始

1. 安裝必要的依賴項

確保 Python >= 3.10，PyTorch >= 2.4，CUDA >= 12.4。
建議使用 Anaconda 創建一個新的環境（Python >= 3.10）conda create -n rllegro python=3.10 -y 來運行以下示例。
運行 pip install git+https://github.com/huggingface/diffusers.git torch==2.4.1 transformers==4.40.1 accelerate sentencepiece imageio imageio-ffmpeg beautifulsoup4

2. 運行推理

import torch
from diffusers import AutoencoderKLAllegro, AllegroPipeline
from diffusers.utils import export_to_video
vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro", subfolder="vae", torch_dtype=torch.float32)
pipe = AllegroPipeline.from_pretrained(
    "rhymes-ai/Allegro", vae=vae, torch_dtype=torch.bfloat16
)
pipe.to("cuda")
pipe.vae.enable_tiling()
prompt = "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats."

positive_prompt = """
(masterpiece), (best quality), (ultra-detailed), (unwatermarked), 
{} 
emotional, harmonious, vignette, 4k epic detailed, shot on kodak, 35mm photo, 
sharp focus, high budget, cinemascope, moody, epic, gorgeous
"""

negative_prompt = """
nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, 
low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.
"""

prompt = prompt.format(prompt.lower().strip())

video = pipe(prompt, negative_prompt=negative_prompt, guidance_scale=7.5, max_sequence_length=512, num_inference_steps=100, generator = torch.Generator(device="cuda:0").manual_seed(42)).frames[0]
export_to_video(video, "output.mp4", fps=15)