CogVideoX-5b開源視頻生成模型 - 免費使用，打造高質量視頻效果

首頁

Cogvideox 5b

由THUDM開發

CogVideoX是源自清影的視頻生成模型的開源版本，提供高質量的視頻生成能力。

文本生成視頻英語開源協議:其他 #文本到視頻生成 #高質量視頻合成 #多GPU高效推理

下載量 92.32k

發布時間 : 8/17/2024

模型概述

CogVideoX-5B是一個更大的視頻生成模型，能夠根據文本描述生成高質量的視頻內容，視覺效果優於較小的2B版本。

模型特點

高質量視頻生成

能夠根據文本提示生成高質量的視頻內容，視覺效果出色

多精度支持

支持BF16、FP16、FP32、FP8和INT8等多種推理精度

高效推理

在H100 GPU上約90秒可完成50步的視頻生成

多GPU支持

支持多GPU推理，最低顯存需求15GB

模型能力

文本到視頻生成

視頻內容創作

創意視覺表達

使用案例

創意內容製作

短視頻創作

根據文本描述自動生成創意短視頻

生成6秒、720x480分辨率的視頻

教育內容

為教育材料生成可視化內容

娛樂產業

概念視頻預覽

快速生成電影或遊戲的概念視頻

🚀 CogVideoX-5B

CogVideoX-5B是一個開源的視頻生成模型，源自清影。它能根據文本描述生成高質量的視頻，具有出色的視覺效果。

🚀 快速開始

本模型支持使用huggingface diffusers庫進行部署，你可以按照以下步驟進行部署。

建議你訪問我們的GitHub，查看相關的提示詞優化和轉換方法，以獲得更好的體驗。

安裝所需依賴

# diffusers>=0.30.1
# transformers>=4.44.2
# accelerate>=0.33.0 (建議從源碼安裝)
# imageio-ffmpeg>=0.5.1
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

運行代碼

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
)

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

✨ 主要特性

模型介紹

CogVideoX是源自清影的視頻生成模型的開源版本。以下表格展示了我們目前提供的視頻生成模型列表及其基礎信息。

模型名稱	CogVideoX-2B	CogVideoX-5B（本倉庫）
模型描述	入門級模型，兼顧兼容性。運行和二次開發成本低。	更大的模型，具有更高的視頻生成質量和更好的視覺效果。
推理精度	*FP16（推薦）*，BF16，FP32，FP8，INT8，不支持INT4	BF16（推薦），FP16，FP32，FP8*，INT8，不支持INT4
單GPU顯存消耗	SAT FP16：18GB diffusers FP16：從4GB起 diffusers INT8(torchao)：從3.6GB起	SAT BF16：26GB diffusers BF16：從5GB起 diffusers INT8(torchao)：從4.4GB起
多GPU推理顯存消耗	使用diffusers的FP16：10GB*	使用diffusers的BF16：15GB*
推理速度（步數 = 50，FP/BF16）	單A100：~90秒單H100：~45秒	單A100：~180秒單H100：~90秒
微調精度	FP16	BF16
微調顯存消耗（每GPU）	47 GB（bs=1，LORA） 61 GB（bs=2，LORA） 62GB（bs=1，SFT）	63 GB（bs=1，LORA） 80 GB（bs=2，LORA） 75GB（bs=1，SFT）
提示詞語言	英語*	英語*
提示詞長度限制	226個標記	226個標記
視頻長度	6秒	6秒
幀率	每秒8幀	每秒8幀
視頻分辨率	720 x 480，不支持其他分辨率（包括微調）	720 x 480，不支持其他分辨率（包括微調）
位置編碼	3d_sincos_pos_embed	3d_rope_pos_embed

數據說明

使用diffusers庫進行測試時，啟用了diffusers庫提供的所有優化。此解決方案尚未在NVIDIA A100 / H100以外的設備上測試實際顯存/內存使用情況。一般來說，該解決方案可適用於所有NVIDIA安培架構及以上的設備。如果禁用優化，顯存使用量將顯著增加，峰值顯存使用量約為表格顯示的3倍。不過，速度將提高3 - 4倍。你可以選擇性地禁用一些優化，包括：

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

進行多GPU推理時，需要禁用enable_model_cpu_offload()優化。
使用INT8模型會降低推理速度。這是為了確保顯存較低的GPU能夠正常進行推理，同時保持最小的視頻質量損失，不過推理速度會顯著下降。
2B模型使用FP16精度進行訓練，5B模型使用BF16精度進行訓練。建議使用模型訓練時的精度進行推理。
PytorchAO和Optimum - quanto可用於對文本編碼器、Transformer和VAE模塊進行量化，以降低CogVideoX的內存需求。這使得在免費的T4 Colab或顯存較小的GPU上運行模型成為可能！值得注意的是，TorchAO量化與torch.compile完全兼容，可顯著提高推理速度。FP8精度必須在NVIDIA H100或更高版本的設備上使用，這需要從源碼安裝torch、torchao、diffusers和accelerate Python包。建議使用CUDA 12.4。
推理速度測試也使用了上述顯存優化方案。如果不進行顯存優化，推理速度大約提高10%。只有diffusers版本的模型支持量化。
模型僅支持英文輸入；其他語言可以在細化過程中通過大模型翻譯成英文。

注意事項

使用SAT對SAT版本的模型進行推理和微調。歡迎訪問我們的GitHub獲取更多信息。

💻 使用示例

基礎用法

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
)

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

高級用法

# 使用PytorchAO和Optimum - quanto進行量化推理
# 首先，需要從GitHub源碼安裝PytorchAO和PyTorch Nightly。
# 從源碼和Nightly版本安裝僅在下次發佈前需要。

import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
from transformers import T5EncoderModel
from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight

quantization = int8_weight_only

text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
quantize_(text_encoder, quantization())

transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
quantize_(transformer, quantization())

vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-5b", subfolder="vae", torch_dtype=torch.bfloat16)
quantize_(vae, quantization())

# 創建管道並運行推理
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

此外，使用PytorchAO時，可以將模型序列化並存儲為量化數據類型，以節省磁盤空間。可在以下鏈接找到示例和基準測試：

📚 詳細文檔

歡迎訪問我們的github，你可以在那裡找到：

更詳細的技術細節和代碼解釋。
提示詞的優化和轉換方法。
SAT版本模型的推理和微調，甚至預發佈內容。
項目更新日誌動態，更多互動機會。
CogVideoX工具鏈，幫助你更好地使用模型。
INT8模型推理代碼支持。

📄 許可證

本模型根據CogVideoX許可證發佈。

📜 引用

@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}

🔍 演示展示

📄 中文閱讀 | 🤗 Huggingface Space | 🌐 Github | 📜 arxiv

📍 訪問清影和 API平臺體驗商業視頻生成模型。

Video Gallery with Captions

花園裡，五彩斑斕的蝴蝶在花叢中翩翩起舞，它們輕盈的翅膀在花瓣上投下陰影。遠處，一座宏偉的噴泉潺潺流淌，其有節奏的聲音營造出舒緩的氛圍。在一棵大樹的陰涼下，一把孤零零的木椅靜靜佇立，邀請人們在此獨處和沉思，它光滑的表面因無數尋求寧靜的訪客的觸摸而變得陳舊。

一個小男孩低著頭，臉上寫滿了堅定，在傾盆大雨中奮力奔跑，遠處電閃雷鳴。無情的雨水猛烈地敲擊著地面，濺起的水滴如同天空憤怒的舞蹈。在遠方，一座溫馨的房屋的輪廓若隱若現，像一盞微弱的明燈，在惡劣的天氣中召喚著安全與溫暖。這一場景展現了一個孩子在逆境中不屈不撓的精神。

在火星粉紅色的天空下，一位身著宇航服的宇航員伸出手，與一個皮膚閃爍著藍色光芒的外星生物握手，火星的紅色塵埃附著在宇航員的靴子上。遠處，一艘銀色的火箭高聳入雲，它是人類智慧的象徵，發動機已經關閉。兩個來自不同世界的代表在這片荒涼而美麗的火星景觀中進行了歷史性的問候。

一位神情安詳的老人坐在水邊，身旁放著一杯熱氣騰騰的茶。他手持畫筆，專注地在畫布上創作一幅油畫，畫布靠在一張破舊的小桌子上。海風輕輕拂過他的銀髮，吹動著他寬鬆的白色襯衫，鹹鹹的空氣為他正在創作的傑作增添了一份獨特的韻味。夕陽的餘暉灑在平靜的海面上，畫布捕捉到了這 vibrant 的色彩，整個場景充滿了寧靜與靈感。

在一家昏暗的酒吧裡，紫色的燈光灑在一位成熟男人的臉上，他若有所思地眨著眼睛，特寫鏡頭聚焦在他的沉思表情上，背景巧妙地虛化，營造出一種神秘的氛圍。

一隻金色的尋回犬戴著時尚的黑色太陽鏡，長長的毛髮在微風中飄動，歡快地在剛剛被小雨洗禮過的屋頂露臺上奔跑。從遠處看，它充滿活力的跳躍越來越近，尾巴興奮地搖擺著，身後的混凝土上閃爍著水滴。陰沉的天空為這隻金色的狗狗提供了一個戲劇性的背景，凸顯出它 vibrant 的金色毛髮。

在一個陽光明媚的日子裡，湖岸邊排列著一排柳樹，它們細長的枝條在微風中輕輕搖曳。平靜的湖面倒映著湛藍的天空，幾隻優雅的天鵝在水中緩緩遊動，留下一道道細膩的漣漪，打破了湖面如鏡的平靜。這一場景寧靜而美麗，柳樹的綠色枝葉為這些和平的鳥類訪客勾勒出一幅如畫的框架。

一位中國母親穿著柔和的 pastel 色長袍，在溫馨的育兒室裡輕輕搖晃著一把舒適的搖椅。昏暗的臥室裡，天花板上懸掛著可愛的風鈴，投下的陰影在牆壁上舞動。她的寶寶裹在一條精緻的圖案毛毯裡，靠在她的胸前，之前的哭聲已被滿足的咕咕聲所取代，母親溫柔的聲音漸漸哄寶寶入睡。空氣中瀰漫著薰衣草的香氣，增添了寧靜的氛圍，附近夜燈發出的溫暖橙色光芒為這一場景披上了一層柔和的色彩，捕捉到了溫馨的母愛瞬間。