Allegro-T2V-40x360P開源文本生成視頻模型 - 免費生成多樣高質量動態視頻

首頁

Allegro T2V 40x360P

由rhymes-ai開發

Allegro是一款開源的文本生成視頻模型，支持生成高質量、多樣化的動態場景視頻。

文本生成視頻英語開源協議:Apache-2.0 #高清視頻生成 #輕量級架構 #動態場景建模

下載量 21

發布時間 : 12/17/2024

模型概述

Allegro是一個先進的文本到視頻生成模型，能夠根據文本描述生成高質量的視頻內容，適用於從人類/動物特寫到多樣化動態場景的廣泛創作需求。

模型特點

開源共享

完整模型權重與代碼已開放，採用Apache 2.0協議

多元創作

支持生成人類/動物特寫到多樣化動態場景的廣泛內容

高清輸出

可生成368x640和720x1280分辨率、15幀率的2-6秒精細視頻

輕量高效

包含1.75億參數VideoVAE與28億參數VideoDiT模型，支持多精度推理

模型能力

文本生成視頻

高質量視頻生成

多樣化場景創作

高清視頻輸出

使用案例

創意內容生成

動態場景創作

根據文本描述生成各種動態場景視頻

生成2-6秒的高質量視頻

特寫視頻生成

生成人類或動物的特寫視頻

精細的人物或動物特寫視頻

影視製作輔助

概念視頻預覽

快速生成影視概念的視頻預覽

幫助影視製作團隊快速可視化創意

🚀 Allegro - 文本到視頻生成模型

Allegro 是一款開源的文本到視頻生成模型，能夠根據輸入的文本生成高質量的視頻內容。它具有廣泛的內容創作能力、高效的模型結構和出色的輸出質量，為視頻生成領域帶來了新的可能性。

🚀 快速開始

安裝必要依賴

確保 Python 版本 >= 3.10，PyTorch 版本 >= 2.4，CUDA 版本 >= 12.4。
建議使用 Anaconda 創建一個新的環境（Python >= 3.10），運行命令 conda create -n rllegro python=3.10 -y 來運行以下示例。
運行命令 pip install git+https://github.com/huggingface/diffusers.git torch==2.4.1 transformers==4.40.1 accelerate sentencepiece imageio imageio-ffmpeg beautifulsoup4。

運行推理

import torch
from diffusers import AutoencoderKLAllegro, AllegroPipeline
from diffusers.utils import export_to_video
vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro-T2V-40x360P", subfolder="vae", torch_dtype=torch.float32)
vae.tile_overlap_t = 8
vae.tile_overlap_h = 144
vae.tile_overlap_w = 64
vae.stride = (16,112,192)

pipe = AllegroPipeline.from_pretrained(
    "rhymes-ai/Allegro-T2V-40x360P", vae=vae, torch_dtype=torch.bfloat16
)
pipe.to("cuda")
pipe.vae.enable_tiling()

prompt = "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats."

positive_prompt = """
(masterpiece), (best quality), (ultra-detailed), (unwatermarked), 
{} 
emotional, harmonious, vignette, 4k epic detailed, shot on kodak, 35mm photo, 
sharp focus, high budget, cinemascope, moody, epic, gorgeous
"""

negative_prompt = """
nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, 
low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.
"""

prompt = prompt.format(prompt.lower().strip())

video = pipe(prompt, negative_prompt=negative_prompt, guidance_scale=7.5, max_sequence_length=512, num_inference_steps=100, generator = torch.Generator(device="cuda:0").manual_seed(42)).frames[0]
export_to_video(video, "output.mp4", fps=15)

使用 pipe.enable_sequential_cpu_offload() 可以將模型卸載到 CPU 以減少 GPU 內存消耗，但推理時間會顯著增加。

（可選）將視頻插值到 30 FPS

建議使用 EMA-VFI 將視頻從 15 FPS 插值到 30 FPS。為了獲得更好的視覺質量，請使用 imageio 保存視頻。

更快的推理方法

如需瞭解如 Context Parallel、PAB 等更快的推理方法，請參考我們的 GitHub 倉庫。

✨ 主要特性

開源：社區可獲取完整的模型權重和代碼，採用 Apache 2.0 許可證！
多功能內容創作：能夠生成廣泛的內容，從人類和動物的特寫鏡頭到各種動態場景。
高質量輸出：以 15 FPS 生成 2 到 6 秒的詳細視頻，分辨率為 368x640 和 720x1280，可使用 EMA-VFI 插值到 30 FPS。
小巧高效：具有 1.75 億參數的 VideoVAE 和 28 億參數的 VideoDiT 模型。支持多種精度（FP32、BF16、FP16），在 BF16 模式下使用 CPU 卸載時，GPU 內存使用量為 9.3 GB。上下文長度為 79.2K，相當於 88 幀。

📦 模型信息

屬性	詳情
模型	Allegro-T2V-40x360P
描述	文本到視頻生成模型
下載地址	Hugging Face
參數	VAE: 1.75 億；DiT: 28 億
推理精度	VAE: FP32/TF32/BF16/FP16（FP32/TF32 最佳）；DiT/T5: BF16/FP32/TF32
上下文長度	9.2K
分辨率	368 x 640
幀數	40
視頻長度	約 3 秒 @ 15 FPS