Allegro-T2V-40x720P開源文本到視頻模型 - 免費生成2至6秒詳細視頻，多分辨率支持

首頁

Allegro T2V 40x720P

由rhymes-ai開發

Allegro是一款開源的高質量文本到視頻生成模型，能夠生成2至6秒、15 FPS的詳細視頻，支持多種分辨率。

文本生成視頻英語開源協議:Apache-2.0 #高清視頻生成 #長序列建模 #輕量級架構

下載量 21

發布時間 : 12/17/2024

模型概述

Allegro是一個先進的文本到視頻生成模型，能夠根據文本提示生成高質量的視頻內容。它支持多種分辨率（368x640和720x1280），並可通過插幀技術提升至30 FPS。

模型特點

開源

完整模型權重和代碼向社區開放，採用Apache 2.0協議。

多樣化內容創作

能夠生成從人類和動物特寫到各種動態場景的廣泛內容。

高質量輸出

生成2至6秒、15 FPS、分辨率為368x640和720x1280的詳細視頻，可通過插幀至30 FPS。

輕量高效

包含1.75億參數的VideoVAE和28億參數的VideoDiT模型。支持多種精度，在BF16模式下啟用CPU卸載時僅佔用9.3 GB顯存。

模型能力

文本到視頻生成

高質量視頻合成

多樣化內容創作

視頻插幀支持

使用案例

創意內容生成

廣告視頻生成

根據產品描述生成高質量的廣告視頻。

生成2至6秒的廣告視頻，可用於社交媒體推廣。

動畫短片創作

根據故事情節生成動畫短片。

生成具有豐富細節的動畫短片，適用於創意項目。

教育

教學視頻生成

根據教學內容生成輔助視頻。

生成高質量的教學視頻，提升學習體驗。

🚀 Allegro - 文本到視頻模型

Allegro是一個開源的文本到視頻生成模型，能夠生成多種內容的高質量視頻，具有參數小、效率高的特點，為視頻內容創作提供了強大的支持。

🚀 快速開始

安裝必要依賴

確保Python版本 >= 3.10，PyTorch版本 >= 2.4，CUDA版本 >= 12.4。
建議使用Anaconda創建一個新的環境（Python >= 3.10），運行命令 conda create -n rllegro python=3.10 -y ，然後在該環境中運行以下示例。
運行命令 pip install git+https://github.com/huggingface/diffusers.git torch==2.4.1 transformers==4.40.1 accelerate sentencepiece imageio imageio-ffmpeg beautifulsoup4

運行推理

import torch
from diffusers import AutoencoderKLAllegro, AllegroPipeline
from diffusers.utils import export_to_video
vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro-T2V-40x720P", subfolder="vae", torch_dtype=torch.float32)

pipe = AllegroPipeline.from_pretrained(
    "rhymes-ai/Allegro-T2V-40x720P", vae=vae, torch_dtype=torch.bfloat16
)
pipe.to("cuda")
pipe.vae.enable_tiling()

prompt = "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats."

positive_prompt = """
(masterpiece), (best quality), (ultra-detailed), (unwatermarked), 
{} 
emotional, harmonious, vignette, 4k epic detailed, shot on kodak, 35mm photo, 
sharp focus, high budget, cinemascope, moody, epic, gorgeous
"""

negative_prompt = """
nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, 
low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.
"""

prompt = prompt.format(prompt.lower().strip())

video = pipe(prompt, negative_prompt=negative_prompt, guidance_scale=7.5, max_sequence_length=512, num_inference_steps=100, generator = torch.Generator(device="cuda:0").manual_seed(42)).frames[0]
export_to_video(video, "output.mp4", fps=15)

使用 pipe.enable_sequential_cpu_offload() 可以將模型卸載到CPU以減少GPU內存消耗，但推理時間會顯著增加。

（可選）將視頻插值到30 FPS

建議使用 EMA-VFI 將視頻從15 FPS插值到30 FPS。為了獲得更好的視覺質量，請使用imageio保存視頻。

更快的推理方法

如需更快的推理方法，如Context Parallel、PAB，請參考我們的 GitHub倉庫。

✨ 主要特性

開源：社區可獲取完整的模型權重和代碼，採用Apache 2.0許可證！
內容創作多樣：能夠生成廣泛的內容，從人類和動物的特寫鏡頭到各種動態場景。
高質量輸出：以15 FPS的幀率生成2到6秒的詳細視頻，分辨率為368x640和720x1280，可使用 EMA-VFI 插值到30 FPS。
小巧高效：具有1.75億參數的VideoVAE和28億參數的VideoDiT模型。支持多種精度（FP32、BF16、FP16），在BF16模式下使用CPU卸載時，GPU內存使用量為9.3 GB。上下文長度為79.2K，相當於88幀。

📦 模型信息

屬性	詳情
模型名稱	Allegro-T2V-40x720P
描述	文本到視頻生成模型
下載地址	Hugging Face
參數	VAE: 1.75億；DiT: 28億
推理精度	VAE: FP32/TF32/BF16/FP16（FP32/TF32最佳）；DiT/T5: BF16/FP32/TF32
上下文長度	36K
分辨率	720 x 1280
幀數	40
視頻長度	3秒 @ 15 FPS