ConsisID-preview开源文本到视频模型 - 频率分解保持人物身份一致性生成视频

Consisid Preview

由 BestWishYsh 开发

通过频率分解实现身份保持的文本到视频生成模型，能够在生成视频时保持人物身份的一致性。

文本生成视频英语开源协议:Apache-2.0 #身份保持视频生成 #频率分解技术 #高分辨率视频

下载量 322

发布时间 : 11/26/2024

模型简介

ConsisID是一个基于THUDM/CogVideoX-5b和THUDM/CogVideoX1.5-5B-I2V微调的文本到视频生成模型，专注于在视频生成过程中保持人物身份的连续性。该模型通过频率分解技术优化了面部特征的保持能力，适用于需要高保真人物身份的视频生成场景。

模型特点

身份保持

通过先进的频率分解技术，在视频生成过程中保持人物面部特征的连续性

高质量视频生成

能够生成720x480分辨率、8FPS的6秒视频

提示优化支持

对长且描述详细的提示有良好响应，提供提示优化建议

模型能力

文本到视频生成

面部特征保持

动态场景生成

使用案例

影视制作

角色场景生成

为特定角色生成连贯的视频场景

保持角色面部特征一致的视频序列

广告创意

品牌代言人生成

生成品牌代言人在不同场景下的连贯视频

身份一致的品牌宣传视频

🚀 [CVPR 2025] 通过频率分解实现身份保留的文本到视频生成

本项目通过频率分解技术实现身份保留的文本到视频生成，为文本到视频生成领域提供了新的解决方案，具有较高的应用价值。

🚀 快速开始

本模型支持使用huggingface的diffusers库进行部署，你可以按照以下步骤进行部署：

我们建议你访问我们的GitHub，查看相关的提示词优化和转换方法，以获得更好的体验。

安装所需依赖

# ConsisID将在下个版本合并到diffusers中，所以目前你需要从源代码安装。
pip install --upgrade consisid_eva_clip pyfacer insightface facexlib transformers accelerate imageio-ffmpeg 
pip install git+https://github.com/huggingface/diffusers.git

运行代码

import torch
from diffusers import ConsisIDPipeline
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
from diffusers.utils import export_to_video
from huggingface_hub import snapshot_download

snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = (
    prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
)
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
pipe.to("cuda")

# ConsisID在长且描述详细的提示词下表现良好。确保图像中的人脸清晰可见（例如，最好是半身或全身照）。
prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
image = "https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/example_images/2.png?raw=true"

id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(
    face_helper_1,
    face_clip_model,
    face_helper_2,
    eva_transform_mean,
    eva_transform_std,
    face_main_model,
    "cuda",
    torch.bfloat16,
    image,
    is_align_face=True,
)

video = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=False,
    id_vit_hidden=id_vit_hidden,
    id_cond=id_cond,
    kps_cond=face_kps,
    generator=torch.Generator("cuda").manual_seed(42),
)
export_to_video(video.frames[0], "output.mp4", fps=8)

✨ 主要特性

身份保留的文本到视频生成：通过频率分解技术，在文本到视频生成过程中保留人物身份特征。
支持huggingface diffusers库部署：方便用户进行模型部署和使用。
对提示词优化有指导：提供了使用GPT - 4o优化提示词的方法和示例。
GPU内存优化策略：针对GPU内存不足的情况，提供了多种内存优化方法。

📦 安装指南

安装所需依赖：

# ConsisID将在下个版本合并到diffusers中，所以目前你需要从源代码安装。
pip install --upgrade consisid_eva_clip pyfacer insightface facexlib transformers accelerate imageio-ffmpeg 
pip install git+https://github.com/huggingface/diffusers.git

💻 使用示例

基础用法

import torch
from diffusers import ConsisIDPipeline
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
from diffusers.utils import export_to_video
from huggingface_hub import snapshot_download

snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = (
    prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
)
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
pipe.to("cuda")

# ConsisID在长且描述详细的提示词下表现良好。确保图像中的人脸清晰可见（例如，最好是半身或全身照）。
prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
image = "https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/example_images/2.png?raw=true"

id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(
    face_helper_1,
    face_clip_model,
    face_helper_2,
    eva_transform_mean,
    eva_transform_std,
    face_main_model,
    "cuda",
    torch.bfloat16,
    image,
    is_align_face=True,
)

video = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=False,
    id_vit_hidden=id_vit_hidden,
    id_cond=id_cond,
    kps_cond=face_kps,
    generator=torch.Generator("cuda").manual_seed(42),
)
export_to_video(video.frames[0], "output.mp4", fps=8)

📚 详细文档

🛠️ 提示词优化器

ConsisID对提示词质量有较高要求，你可以使用GPT - 4o来优化输入的文本提示词，示例如下（原始提示词："a man is playing guitar."）

a man is playing guitar.

将上述句子修改为类似以下的内容（添加一些面部变化，即使很细微。不要使句子过长）： 

The video features a man standing next to an airplane, engaged in a conversation on his cell phone. he is wearing sunglasses and a black top, and he appears to be talking seriously. The airplane has a green stripe running along its side, and there is a large engine visible behind his. The man seems to be standing near the entrance of the airplane, possibly preparing to board or just having disembarked. The setting suggests that he might be at an airport or a private airfield. The overall atmosphere of the video is professional and focused, with the man's attire and the presence of the airplane indicating a business or travel context.

部分示例提示词可在此处查看。

💡 GPU内存优化

ConsisID解码49帧（8 FPS下6秒的视频），输出分辨率为720x480（宽x高）时，大约需要44GB的GPU内存，这使得它无法在消费级GPU或免费的T4 Colab上运行。可以使用以下内存优化方法来减少内存占用。如需复现，可参考此脚本。

特性（叠加前一个）	最大分配内存	最大保留内存
-	37 GB	44 GB
enable_model_cpu_offload	22 GB	25 GB
enable_sequential_cpu_offload	16 GB	22 GB
vae.enable_slicing	16 GB	22 GB
vae.enable_tiling	5 GB	7 GB

# 如果你没有多个GPU或足够的GPU内存（如H100），请开启以下选项
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

⚠️ 重要提示：这会增加推理时间，并且可能会降低质量。

🙌 项目信息

仓库地址：代码，项目页面，数据
论文地址：https://huggingface.co/papers/2411.17440
联系人：袁深海

🔧 技术细节

本项目基于以下基础模型：

属性	详情
模型类型	THUDM/CogVideoX - 5b、THUDM/CogVideoX1.5 - 5B - I2V
训练数据	BestWishYsh/ConsisID - preview - Data

📄 许可证

本项目采用apache - 2.0许可证。

✏️ 引用

如果你觉得我们的论文和代码在你的研究中有用，请考虑给我们一个Star并引用：

@article{yuan2024identity,
  title={Identity-Preserving Text-to-Video Generation by Frequency Decomposition},
  author={Yuan, Shenghai and Huang, Jinfa and He, Xianyi and Ge, Yunyuan and Shi, Yujun and Chen, Liuhan and Luo, Jiebo and Yuan, Li},
  journal={arXiv preprint arXiv:2411.17440},
  year={2024}
}