ConsisID-previewオープンソースのテキストからビデオへのモデル - 周波数分解による人物の身元一致性を維持したビデオ生成

Consisid Preview

BestWishYshによって開発

周波数分解によるアイデンティティ保持を実現したテキストから動画を生成するモデルで、動画生成時に人物のアイデンティティの一貫性を保持できます。

テキスト生成ビデオ英語オープンソースライセンス:Apache-2.0 #アイデンティティ保持動画生成 #周波数分解技術 #高解像度動画

ダウンロード数 322

リリース時間 : 11/26/2024

モデル概要

ConsisIDはTHUDM/CogVideoX-5bとTHUDM/CogVideoX1.5-5B-I2Vをファインチューニングしたテキストから動画を生成するモデルで、動画生成プロセス中の人物アイデンティティの連続性保持に特化しています。このモデルは周波数分解技術により顔特徴の保持能力を最適化し、高忠実度な人物アイデンティティが必要な動画生成シナリオに適しています。

モデル特徴

アイデンティティ保持

先進的な周波数分解技術により、動画生成プロセス中に人物の顔特徴の連続性を保持

高品質動画生成

720x480解像度、8FPSの6秒間動画を生成可能

プロンプト最適化サポート

長く詳細な説明のプロンプトに良好に反応し、プロンプト最適化の提案を提供

モデル能力

テキストから動画生成

顔特徴保持

動的シーン生成

使用事例

映像制作

キャラクターシーン生成

特定のキャラクターに対して一貫性のある動画シーンを生成

キャラクターの顔特徴が一貫した動画シーケンス

広告クリエイティブ

ブランド広告タレント生成

ブランド広告タレントの異なるシーンでの一貫性のある動画を生成

アイデンティティが一貫したブランドプロモーション動画

🚀 [CVPR 2025] 周波数分解による同一性保持テキストビデオ生成

周波数分解を用いた、同一性を保持したテキストからビデオを生成するモデルです。Huggingfaceのdiffusersライブラリでのデプロイに対応しています。

🤗 Huggingface Space | 📄 ページ | 🌐 Github | 📜 arxiv | 🐳 データセット

もしこのプロジェクトが気に入ったら、最新のアップデートを受け取るためにGitHubでスターをつけてください⭐。

📋 基本情報

プロパティ	詳細
ベースモデル	THUDM/CogVideoX-5b、THUDM/CogVideoX1.5-5B-I2V
データセット	BestWishYsh/ConsisID-preview-Data
言語	en
ライブラリ名	diffusers
ライセンス	apache-2.0
パイプラインタグ	text-to-video
タグ	IPT2V
ベースモデル関係	finetune

😍 ギャラリー

同一性を保持したテキストからビデオを生成します。（一部の良いプロンプトはこちら）または、こちらをクリックしてビデオを見ることもできます。

🚀 クイックスタート

このモデルは、huggingfaceのdiffusersライブラリを使用したデプロイをサポートしています。以下の手順でデプロイできます。

⚠️ 重要な注意

より良い体験を得るために、GitHubを訪れ、関連するプロンプトの最適化と変換を確認することをおすすめします。

1. 必要な依存関係をインストールする

# ConsisIDは次のバージョンでdiffusersに統合されます。現在は、ソースからインストールする必要があります。
pip install --upgrade consisid_eva_clip pyfacer insightface facexlib transformers accelerate imageio-ffmpeg 
pip install git+https://github.com/huggingface/diffusers.git

2. コードを実行する

import torch
from diffusers import ConsisIDPipeline
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
from diffusers.utils import export_to_video
from huggingface_hub import snapshot_download

snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = (
    prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
)
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
pipe.to("cuda")

# ConsisIDは、長く詳細に記述されたプロンプトでうまく機能します。画像内の顔がはっきりと見えるようにしてください（例：半身または全身が望ましい）。
prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
image = "https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/example_images/2.png?raw=true"

id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(
    face_helper_1,
    face_clip_model,
    face_helper_2,
    eva_transform_mean,
    eva_transform_std,
    face_main_model,
    "cuda",
    torch.bfloat16,
    image,
    is_align_face=True,
)

video = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=False,
    id_vit_hidden=id_vit_hidden,
    id_cond=id_cond,
    kps_cond=face_kps,
    generator=torch.Generator("cuda").manual_seed(42),
)
export_to_video(video.frames[0], "output.mp4", fps=8)

🛠️ プロンプトリファイナー

ConsisIDは、プロンプトの品質に対する要求が高いです。GPT-4oを使用して、入力テキストプロンプトを改良することができます。以下は例です（元のプロンプト： "a man is playing guitar."）

a man is playing guitar.

Change the sentence above to something like this (add some facial changes, even if they are minor. Don't make the sentence too long): 

The video features a man standing next to an airplane, engaged in a conversation on his cell phone. he is wearing sunglasses and a black top, and he appears to be talking seriously. The airplane has a green stripe running along its side, and there is a large engine visible behind his. The man seems to be standing near the entrance of the airplane, possibly preparing to board or just having disembarked. The setting suggests that he might be at an airport or a private airfield. The overall atmosphere of the video is professional and focused, with the man's attire and the presence of the airplane indicating a business or travel context.

一部のサンプルプロンプトはこちらにあります。

💡 GPUメモリの最適化

ConsisIDは、出力解像度720x480（W x H）で49フレーム（8FPSで6秒のビデオ）をデコードするために約44GBのGPUメモリが必要です。これにより、消費者向けGPUや無料のT4 Colabでは実行できなくなります。以下のメモリ最適化を使用して、メモリ使用量を削減することができます。再現するには、このスクリプトを参照できます。

機能（前のものを上書き）	最大割り当てメモリ	最大予約メモリ
-	37 GB	44 GB
enable_model_cpu_offload	22 GB	25 GB
enable_sequential_cpu_offload	16 GB	22 GB
vae.enable_slicing	16 GB	22 GB
vae.enable_tiling	5 GB	7 GB

# 複数のGPUがない場合や、十分なGPUメモリ（例：H100）がない場合は、有効にしてください。
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

⚠️ 重要な注意

これにより、推論に時間がかかり、品質が低下する可能性もあります。

🙌 説明

リポジトリ: コード, ページ, データ
論文: https://huggingface.co/papers/2411.17440
問い合わせ先: Shenghai Yuan

✏️ 引用

もしこの論文とコードがあなたの研究に役立った場合は、スターをつけて引用することを検討してください。

@article{yuan2024identity,
  title={Identity-Preserving Text-to-Video Generation by Frequency Decomposition},
  author={Yuan, Shenghai and Huang, Jinfa and He, Xianyi and Ge, Yunyuan and Shi, Yujun and Chen, Liuhan and Luo, Jiebo and Yuan, Li},
  journal={arXiv preprint arXiv:2411.17440},
  year={2024}
}