Allegro-T2V-40x360Pオープンソースのテキストから動画を生成するモデル

ホーム

Allegro T2V 40x360P

rhymes-aiによって開発

Allegroはオープンソースのテキスト生成ビデオモデルで、高品質で多様な動的シーンビデオを生成できます。

テキスト生成ビデオ英語オープンソースライセンス:Apache-2.0 #高精細ビデオ生成 #軽量アーキテクチャ #動的シーン建模

ダウンロード数 21

リリース時間 : 12/17/2024

モデル概要

Allegroは先進的なテキストからビデオを生成するモデルで、テキスト記述に基づいて高品質なビデオコンテンツを生成し、人間/動物のクローズアップから多様な動的シーンまで幅広い創作ニーズに対応します。

モデル特徴

オープンソース共有

完全なモデル重みとコードがApache 2.0ライセンスで公開されています

多様な創作

人間/動物のクローズアップから多様な動的シーンまで幅広いコンテンツを生成可能

高精細出力

368x640および720x1280解像度、15fpsの2-6秒間の高品質ビデオを生成可能

軽量で効率的

1.75億パラメータのVideoVAEと28億パラメータのVideoDiTモデルを含み、マルチ精度推論をサポート

モデル能力

テキスト生成ビデオ

高品質ビデオ生成

多様なシーン創作

高精細ビデオ出力

使用事例

クリエイティブコンテンツ生成

動的シーン創作

テキスト記述に基づいて様々な動的シーンビデオを生成

2-6秒間の高品質ビデオを生成

クローズアップビデオ生成

人間や動物のクローズアップビデオを生成

精密な人物や動物のクローズアップビデオ

映像制作支援

コンセプトビデオプレビュー

映像コンセプトのビデオプレビューを迅速に生成

映像制作チームがアイデアを迅速に可視化するのに役立ちます

🚀 Allegro

Allegroは、テキストから高品質なビデオを生成するオープンソースのモデルです。多様なコンテンツを生成でき、小さくて効率的なパラメータ構成を持っています。

🚀 クイックスタート

必要要件のインストール

Python >= 3.10、PyTorch >= 2.4、CUDA >= 12.4を確認してください。
Anacondaを使って新しい環境（Python >= 3.10）を作成することをおすすめします。conda create -n rllegro python=3.10 -y を実行してから、以下の例を実行します。
pip install git+https://github.com/huggingface/diffusers.git torch==2.4.1 transformers==4.40.1 accelerate sentencepiece imageio imageio-ffmpeg beautifulsoup4 を実行します。

推論の実行

import torch
from diffusers import AutoencoderKLAllegro, AllegroPipeline
from diffusers.utils import export_to_video
vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro-T2V-40x360P", subfolder="vae", torch_dtype=torch.float32)
vae.tile_overlap_t = 8
vae.tile_overlap_h = 144
vae.tile_overlap_w = 64
vae.stride = (16,112,192)

pipe = AllegroPipeline.from_pretrained(
    "rhymes-ai/Allegro-T2V-40x360P", vae=vae, torch_dtype=torch.bfloat16
)
pipe.to("cuda")
pipe.vae.enable_tiling()

prompt = "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats."

positive_prompt = """
(masterpiece), (best quality), (ultra-detailed), (unwatermarked), 
{} 
emotional, harmonious, vignette, 4k epic detailed, shot on kodak, 35mm photo, 
sharp focus, high budget, cinemascope, moody, epic, gorgeous
"""

negative_prompt = """
nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, 
low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.
"""

prompt = prompt.format(prompt.lower().strip())

video = pipe(prompt, negative_prompt=negative_prompt, guidance_scale=7.5, max_sequence_length=512, num_inference_steps=100, generator = torch.Generator(device="cuda:0").manual_seed(42)).frames[0]
export_to_video(video, "output.mp4", fps=15)

pipe.enable_sequential_cpu_offload() を使ってモデルをCPUにオフロードすると、GPUメモリの使用量を減らせますが、推論時間は大幅に増えます。

（オプション）ビデオを30 FPSに補間する

EMA-VFI を使って、ビデオを15 FPSから30 FPSに補間することをおすすめします。視覚的な品質を向上させるために、imageioを使ってビデオを保存してください。

高速推論について

Context ParallelやPABなどの高速推論については、GitHubリポジトリを参照してください。

✨ 主な機能

オープンソース：モデルの重みとコードがコミュニティに公開されており、Apache 2.0ライセンスです！
多様なコンテンツ作成：人や動物のクローズアップから、様々なダイナミックなシーンまで、幅広いコンテンツを生成できます。
高品質な出力：15 FPSで2～6秒の詳細なビデオを、368x640および720x1280の解像度で生成します。EMA-VFI を使って30 FPSに補間することもできます。
小型で効率的：175MパラメータのVideoVAEと2.8BパラメータのVideoDiTモデルを備えています。複数の精度（FP32、BF16、FP16）をサポートし、CPUオフロードを使用したBF16モードでは9.3 GBのGPUメモリを使用します。コンテキスト長は79.2K（88フレーム相当）です。

📦 インストール

上記の「クイックスタート」の「必要要件のインストール」を参照してください。

💻 使用例

基本的な使用法

import torch
from diffusers import AutoencoderKLAllegro, AllegroPipeline
from diffusers.utils import export_to_video
vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro-T2V-40x360P", subfolder="vae", torch_dtype=torch.float32)
vae.tile_overlap_t = 8
vae.tile_overlap_h = 144
vae.tile_overlap_w = 64
vae.stride = (16,112,192)

pipe = AllegroPipeline.from_pretrained(
    "rhymes-ai/Allegro-T2V-40x360P", vae=vae, torch_dtype=torch.bfloat16
)
pipe.to("cuda")
pipe.vae.enable_tiling()

prompt = "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this location might be a popular spot for docking fishing boats."

positive_prompt = """
(masterpiece), (best quality), (ultra-detailed), (unwatermarked), 
{} 
emotional, harmonious, vignette, 4k epic detailed, shot on kodak, 35mm photo, 
sharp focus, high budget, cinemascope, moody, epic, gorgeous
"""

negative_prompt = """
nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, 
low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.
"""

prompt = prompt.format(prompt.lower().strip())

video = pipe(prompt, negative_prompt=negative_prompt, guidance_scale=7.5, max_sequence_length=512, num_inference_steps=100, generator = torch.Generator(device="cuda:0").manual_seed(42)).frames[0]
export_to_video(video, "output.mp4", fps=15)

📚 ドキュメント

ギャラリー

詳細なデモと対応するプロンプトについては、[Allegroギャラリー](https://rhymes.ai/allegro_gallery) を参照してください。

モデル情報

属性	详情
モデル名	Allegro-T2V-40x360P
説明	テキストからビデオを生成するモデル
ダウンロード	Hugging Face
パラメータ	VAE: 175M DiT: 2.8B
推論精度	VAE: FP32/TF32/BF16/FP16 (FP32/TF32が最適) DiT/T5: BF16/FP32/TF32
コンテキスト長	9.2K
解像度	368 x 640
フレーム数	40
ビデオ長	15 FPSで約3秒