ConsisID-preview開源文本到視頻模型 - 頻率分解保持人物身份一致性生成視頻

Consisid Preview

由BestWishYsh開發

通過頻率分解實現身份保持的文本到視頻生成模型，能夠在生成視頻時保持人物身份的一致性。

文本生成視頻英語開源協議:Apache-2.0 #身份保持視頻生成 #頻率分解技術 #高分辨率視頻

下載量 322

發布時間 : 11/26/2024

模型概述

ConsisID是一個基於THUDM/CogVideoX-5b和THUDM/CogVideoX1.5-5B-I2V微調的文本到視頻生成模型，專注於在視頻生成過程中保持人物身份的連續性。該模型通過頻率分解技術優化了面部特徵的保持能力，適用於需要高保真人物身份的視頻生成場景。

模型特點

身份保持

通過先進的頻率分解技術，在視頻生成過程中保持人物面部特徵的連續性

高質量視頻生成

能夠生成720x480分辨率、8FPS的6秒視頻

提示優化支持

對長且描述詳細的提示有良好響應，提供提示優化建議

模型能力

文本到視頻生成

面部特徵保持

動態場景生成

使用案例

影視製作

角色場景生成

為特定角色生成連貫的視頻場景

保持角色面部特徵一致的視頻序列

廣告創意

品牌代言人生成

生成品牌代言人在不同場景下的連貫視頻

身份一致的品牌宣傳視頻

🚀 [CVPR 2025] 通過頻率分解實現身份保留的文本到視頻生成

本項目通過頻率分解技術實現身份保留的文本到視頻生成，為文本到視頻生成領域提供了新的解決方案，具有較高的應用價值。

🚀 快速開始

本模型支持使用huggingface的diffusers庫進行部署，你可以按照以下步驟進行部署：

我們建議你訪問我們的GitHub，查看相關的提示詞優化和轉換方法，以獲得更好的體驗。

安裝所需依賴

# ConsisID將在下個版本合併到diffusers中，所以目前你需要從源代碼安裝。
pip install --upgrade consisid_eva_clip pyfacer insightface facexlib transformers accelerate imageio-ffmpeg 
pip install git+https://github.com/huggingface/diffusers.git

運行代碼

import torch
from diffusers import ConsisIDPipeline
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
from diffusers.utils import export_to_video
from huggingface_hub import snapshot_download

snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = (
    prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
)
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
pipe.to("cuda")

# ConsisID在長且描述詳細的提示詞下表現良好。確保圖像中的人臉清晰可見（例如，最好是半身或全身照）。
prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
image = "https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/example_images/2.png?raw=true"

id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(
    face_helper_1,
    face_clip_model,
    face_helper_2,
    eva_transform_mean,
    eva_transform_std,
    face_main_model,
    "cuda",
    torch.bfloat16,
    image,
    is_align_face=True,
)

video = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=False,
    id_vit_hidden=id_vit_hidden,
    id_cond=id_cond,
    kps_cond=face_kps,
    generator=torch.Generator("cuda").manual_seed(42),
)
export_to_video(video.frames[0], "output.mp4", fps=8)

✨ 主要特性

身份保留的文本到視頻生成：通過頻率分解技術，在文本到視頻生成過程中保留人物身份特徵。
支持huggingface diffusers庫部署：方便用戶進行模型部署和使用。
對提示詞優化有指導：提供了使用GPT - 4o優化提示詞的方法和示例。
GPU內存優化策略：針對GPU內存不足的情況，提供了多種內存優化方法。

📦 安裝指南

安裝所需依賴：

# ConsisID將在下個版本合併到diffusers中，所以目前你需要從源代碼安裝。
pip install --upgrade consisid_eva_clip pyfacer insightface facexlib transformers accelerate imageio-ffmpeg 
pip install git+https://github.com/huggingface/diffusers.git

💻 使用示例

基礎用法

import torch
from diffusers import ConsisIDPipeline
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
from diffusers.utils import export_to_video
from huggingface_hub import snapshot_download

snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = (
    prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
)
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
pipe.to("cuda")

# ConsisID在長且描述詳細的提示詞下表現良好。確保圖像中的人臉清晰可見（例如，最好是半身或全身照）。
prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
image = "https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/example_images/2.png?raw=true"

id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(
    face_helper_1,
    face_clip_model,
    face_helper_2,
    eva_transform_mean,
    eva_transform_std,
    face_main_model,
    "cuda",
    torch.bfloat16,
    image,
    is_align_face=True,
)

video = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=False,
    id_vit_hidden=id_vit_hidden,
    id_cond=id_cond,
    kps_cond=face_kps,
    generator=torch.Generator("cuda").manual_seed(42),
)
export_to_video(video.frames[0], "output.mp4", fps=8)

📚 詳細文檔

🛠️ 提示詞優化器

ConsisID對提示詞質量有較高要求，你可以使用GPT - 4o來優化輸入的文本提示詞，示例如下（原始提示詞："a man is playing guitar."）

a man is playing guitar.

將上述句子修改為類似以下的內容（添加一些面部變化，即使很細微。不要使句子過長）： 

The video features a man standing next to an airplane, engaged in a conversation on his cell phone. he is wearing sunglasses and a black top, and he appears to be talking seriously. The airplane has a green stripe running along its side, and there is a large engine visible behind his. The man seems to be standing near the entrance of the airplane, possibly preparing to board or just having disembarked. The setting suggests that he might be at an airport or a private airfield. The overall atmosphere of the video is professional and focused, with the man's attire and the presence of the airplane indicating a business or travel context.

部分示例提示詞可在此處查看。

💡 GPU內存優化

ConsisID解碼49幀（8 FPS下6秒的視頻），輸出分辨率為720x480（寬x高）時，大約需要44GB的GPU內存，這使得它無法在消費級GPU或免費的T4 Colab上運行。可以使用以下內存優化方法來減少內存佔用。如需復現，可參考此腳本。

特性（疊加前一個）	最大分配內存	最大保留內存
-	37 GB	44 GB
enable_model_cpu_offload	22 GB	25 GB
enable_sequential_cpu_offload	16 GB	22 GB
vae.enable_slicing	16 GB	22 GB
vae.enable_tiling	5 GB	7 GB

# 如果你沒有多個GPU或足夠的GPU內存（如H100），請開啟以下選項
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

⚠️ 重要提示：這會增加推理時間，並且可能會降低質量。

🙌 項目信息

倉庫地址：代碼，項目頁面，數據
論文地址：https://huggingface.co/papers/2411.17440
聯繫人：袁深海

🔧 技術細節

本項目基於以下基礎模型：

屬性	詳情
模型類型	THUDM/CogVideoX - 5b、THUDM/CogVideoX1.5 - 5B - I2V
訓練數據	BestWishYsh/ConsisID - preview - Data

📄 許可證

本項目採用apache - 2.0許可證。

✏️ 引用

如果你覺得我們的論文和代碼在你的研究中有用，請考慮給我們一個Star並引用：

@article{yuan2024identity,
  title={Identity-Preserving Text-to-Video Generation by Frequency Decomposition},
  author={Yuan, Shenghai and Huang, Jinfa and He, Xianyi and Ge, Yunyuan and Shi, Yujun and Chen, Liuhan and Luo, Jiebo and Yuan, Li},
  journal={arXiv preprint arXiv:2411.17440},
  year={2024}
}