Consisid Preview

Developed by BestWishYsh

A text-to-video generation model that maintains identity consistency through frequency decomposition.

Text-to-Video EnglishOpen Source License:Apache-2.0 #Identity-preserving video generation #Frequency decomposition technique #High-resolution video

Downloads 322

Release Time : 11/26/2024

Model Overview

ConsisID is a fine-tuned text-to-video generation model based on THUDM/CogVideoX-5b and THUDM/CogVideoX1.5-5B-I2V, focusing on maintaining character identity continuity during video generation. The model enhances facial feature preservation through frequency decomposition technology, suitable for high-fidelity identity-preserving video generation scenarios.

Model Features

Identity preservation

Maintains continuity of facial features during video generation through advanced frequency decomposition technology

High-quality video generation

Capable of generating 6-second videos at 720x480 resolution and 8FPS

Prompt optimization support

Responds well to long and detailed prompts, providing prompt optimization suggestions

Model Capabilities

Text-to-video generation

Facial feature preservation

Dynamic scene generation

Use Cases

Film production

Character scene generation

Generating coherent video scenes for specific characters

Video sequences with consistent character facial features

Advertising creativity

Brand spokesperson generation

Generating coherent videos of brand spokespersons in different scenarios

Brand promotional videos with consistent identity

base_model:

THUDM/CogVideoX-5b
THUDM/CogVideoX1.5-5B-I2V datasets:
BestWishYsh/ConsisID-preview-Data language:
en library_name: diffusers license: apache-2.0 pipeline_tag: text-to-video tags:
IPT2V base_model_relation: finetune

[CVPR 2025] Identity-Preserving Text-to-Video Generation by Frequency Decomposition

🤗 Huggingface Space | 📄 Page | 🌐 Github | 📜 arxiv | 🐳 Dataset

If you like our project, please give us a star ⭐ on GitHub for the latest update.

😍 Gallery

Identity-Preserving Text-to-Video Generation. (Some best prompts here) or you can click here to watch the video.

🤗 Quick Start

This model supports deployment using the huggingface diffusers library. You can deploy it by following these steps.

We recommend that you visit our GitHub and check out the relevant prompt optimizations and conversions to get a better experience.

Install the required dependencies

# ConsisID will be merged into diffusers in the next version. So for now, you should install from source.
pip install --upgrade consisid_eva_clip pyfacer insightface facexlib transformers accelerate imageio-ffmpeg 
pip install git+https://github.com/huggingface/diffusers.git

Run the code

import torch
from diffusers import ConsisIDPipeline
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
from diffusers.utils import export_to_video
from huggingface_hub import snapshot_download

snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = (
    prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
)
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
pipe.to("cuda")

# ConsisID works well with long and well-described prompts. Make sure the face in the image is clearly visible (e.g., preferably half-body or full-body).
prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
image = "https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/example_images/2.png?raw=true"

id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(
    face_helper_1,
    face_clip_model,
    face_helper_2,
    eva_transform_mean,
    eva_transform_std,
    face_main_model,
    "cuda",
    torch.bfloat16,
    image,
    is_align_face=True,
)

video = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=False,
    id_vit_hidden=id_vit_hidden,
    id_cond=id_cond,
    kps_cond=face_kps,
    generator=torch.Generator("cuda").manual_seed(42),
)
export_to_video(video.frames[0], "output.mp4", fps=8)

🛠️ Prompt Refiner

ConsisID has high requirements for prompt quality. You can use GPT-4o to refine the input text prompt, an example is as follows (original prompt: "a man is playing guitar.")

a man is playing guitar.

Change the sentence above to something like this (add some facial changes, even if they are minor. Don't make the sentence too long): 

The video features a man standing next to an airplane, engaged in a conversation on his cell phone. he is wearing sunglasses and a black top, and he appears to be talking seriously. The airplane has a green stripe running along its side, and there is a large engine visible behind his. The man seems to be standing near the entrance of the airplane, possibly preparing to board or just having disembarked. The setting suggests that he might be at an airport or a private airfield. The overall atmosphere of the video is professional and focused, with the man's attire and the presence of the airplane indicating a business or travel context.

Some sample prompts are available here.

💡 GPU Memory Optimization

ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to this script.

Feature (overlay the previous)	Max Memory Allocated	Max Memory Reserved
-	37 GB	44 GB
enable_model_cpu_offload	22 GB	25 GB
enable_sequential_cpu_offload	16 GB	22 GB
vae.enable_slicing	16 GB	22 GB
vae.enable_tiling	5 GB	7 GB

# turn on if you don't have multiple GPUs or enough GPU memory(such as H100)
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

warning: it will cost more time in inference and may also reduce the quality.

🙌 Description

Repository: Code, Page, Data
Paper: https://huggingface.co/papers/2411.17440
Point of Contact: Shenghai Yuan

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star and citation.

@article{yuan2024identity,
  title={Identity-Preserving Text-to-Video Generation by Frequency Decomposition},
  author={Yuan, Shenghai and Huang, Jinfa and He, Xianyi and Ge, Yunyuan and Shi, Yujun and Chen, Liuhan and Luo, Jiebo and Yuan, Li},
  journal={arXiv preprint arXiv:2411.17440},
  year={2024}
}

🤝 Contributors

```

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご