CogVideoX-2b開源視頻生成模型 - 入門首選，兼容性好且運行開發成本低

首頁

Cogvideox 2b

由THUDM開發

CogVideoX是源自清影的開源視頻生成模型，2B版本是入門級模型，平衡兼容性，運行和二次開發成本低。

文本生成視頻英語開源協議:Apache-2.0 #文本轉視頻 #低顯存優化 #英語專用

下載量 40.55k

發布時間 : 8/5/2024

模型概述

CogVideoX是一個文本生成視頻的模型，能夠根據文本描述生成6秒的視頻內容。

模型特點

低資源需求

入門級模型設計，適合在資源有限的設備上運行

高質量視頻生成

能夠生成6秒、720x480分辨率的視頻，幀率為8fps

多精度支持

支持FP16、BF16、FP32、FP8、INT8等多種推理精度

優化推理

通過diffusers庫提供多種優化選項，降低VRAM需求

模型能力

文本到視頻生成

視頻內容創作

創意內容生成

使用案例

創意內容製作

短視頻創作

根據文本描述自動生成短視頻內容

生成6秒的創意視頻

廣告內容生成

快速生成產品展示視頻

生成720x480分辨率的廣告視頻

教育

教學視頻生成

根據教學內容自動生成輔助視頻

🚀 CogVideoX-2B

CogVideoX-2B是一個視頻生成模型，可根據文本描述生成視頻內容。它具有不同規模的版本，能滿足不同的使用需求，如低資源成本運行或高質量視頻生成。

📄 中文閱讀 | 🤗 Huggingface Space | 🌐 Github | 📜 arxiv

📍 訪問 QingYing 和 API平臺體驗商業視頻生成模型。

🚀 快速開始

這個模型支持使用huggingface diffusers庫進行部署。你可以按照以下步驟進行部署：

我們建議你訪問我們的 GitHub ，查看相關的提示詞優化和轉換，以獲得更好的體驗。

1. 安裝所需依賴

# diffusers>=0.30.1
# transformers>=0.44.0
# accelerate>=0.33.0 (建議從源碼安裝)
# imageio-ffmpeg>=0.5.1
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

2. 運行代碼

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

✨ 主要特性

模型介紹

CogVideoX是源自 QingYing 的視頻生成模型的開源版本。下表展示了我們目前提供的視頻生成模型列表及其基礎信息：

模型名稱	CogVideoX-2B (本倉庫)	CogVideoX-5B
模型描述	入門級模型，兼顧兼容性。運行和二次開發成本低。	更大的模型，具有更高的視頻生成質量和更好的視覺效果。
推理精度	*FP16 (推薦)*，BF16，FP32，FP8，INT8，不支持INT4	BF16 (推薦)，FP16，FP32，FP8*，INT8，不支持INT4
單GPU顯存消耗	SAT FP16: 18GB *diffusers FP16: 從4GB 起 diffusers INT8(torchao): 從3.6GB* 起**	SAT BF16: 26GB *diffusers BF16: 從5GB 起 diffusers INT8(torchao): 從4.4GB* 起**
多GPU推理顯存消耗	FP16: 使用diffusers時為10GB*	BF16: 使用diffusers時為15GB*
推理速度 (Step = 50, FP/BF16)	單A100: ~90秒單H100: ~45秒	單A100: ~180秒單H100: ~90秒
微調精度	FP16	BF16
微調顯存消耗 (每GPU)	47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT)	63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)
提示詞語言	英文*
提示詞長度限制	226 Tokens
視頻長度	6秒
幀率	8幀/秒
視頻分辨率	720 x 480，不支持其他分辨率 (包括微調)
位置編碼	3d_sincos_pos_embed	3d_rope_pos_embed

數據說明

當使用 diffusers 庫進行測試時，啟用了 diffusers 庫提供的所有優化。此解決方案尚未在 NVIDIA A100 / H100 以外的設備上測試實際顯存/內存使用情況。一般來說，該解決方案可適用於所有 NVIDIA安培架構 及以上的設備。如果禁用優化，顯存使用量將顯著增加，峰值顯存使用量約為表格顯示的3倍。不過，速度將提高3 - 4倍。你可以選擇性地禁用一些優化，包括：

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

進行多GPU推理時，需要禁用 enable_model_cpu_offload() 優化。
使用INT8模型會降低推理速度。這是為了確保顯存較低的GPU能夠正常進行推理，同時保持最小的視頻質量損失，不過推理速度會顯著下降。
2B模型使用 FP16 精度進行訓練，5B模型使用 BF16 精度進行訓練。我們建議使用模型訓練時的精度進行推理。
PytorchAO 和 Optimum-quanto 可用於對文本編碼器、Transformer和VAE模塊進行量化，以降低CogVideoX的內存需求。這使得在免費的T4 Colab或顯存較小的GPU上運行模型成為可能！值得注意的是，TorchAO量化與 torch.compile 完全兼容，可顯著提高推理速度。FP8 精度必須在 NVIDIA H100 及以上的設備上使用，這需要從源碼安裝 torch、torchao、diffusers 和 accelerate Python包。建議使用 CUDA 12.4。
推理速度測試也使用了上述顯存優化方案。如果不進行顯存優化，推理速度將提高約10%。只有 diffusers 版本的模型支持量化。
模型僅支持英文輸入；其他語言可以在細化過程中通過大模型翻譯成英文。

注意事項

使用 SAT 對SAT版本的模型進行推理和微調。歡迎訪問我們的GitHub獲取更多信息。

💻 使用示例

量化推理

PytorchAO 和 Optimum-quanto 可用於對文本編碼器、Transformer和VAE模塊進行量化，以降低CogVideoX的內存需求。這使得在免費的T4 Colab或顯存較小的GPU上運行模型成為可能！值得注意的是，TorchAO量化與 torch.compile 完全兼容，可顯著提高推理速度。

# 首先，需要從GitHub源碼安裝PytorchAO和PyTorch Nightly。
# 在下次發佈之前，需要進行源碼和Nightly安裝。

import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
from transformers import T5EncoderModel
from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight

quantization = int8_weight_only

text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
quantize_(text_encoder, quantization())

transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
quantize_(transformer, quantization())

vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.bfloat16)
quantize_(vae, quantization())

# 創建管道並運行推理
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

此外，使用PytorchAO時，可以將模型序列化並存儲為量化數據類型以節省磁盤空間。可在以下鏈接找到示例和基準測試：

📚 詳細文檔

歡迎訪問我們的 github，在那裡你可以找到：

更詳細的技術細節和代碼解釋。
提示詞的優化和轉換。
SAT版本模型的推理和微調，甚至預發佈內容。
項目更新日誌動態，更多互動機會。
CogVideoX工具鏈，幫助你更好地使用模型。
INT8模型推理代碼支持。

📄 許可證

CogVideoX-2B模型（包括其對應的Transformers模塊和VAE模塊）根據 Apache 2.0許可證發佈。

CogVideoX-5B模型（Transformers模塊）根據 CogVideoX許可證發佈。

📜 引用

@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}

🎥 演示展示

Video Gallery with Captions

可以看到一艘精緻的木製玩具船，其桅杆和帆雕刻精美，在一塊柔軟的藍色地毯上平穩地滑行，這塊地毯宛如海浪。船身漆成深棕色，還有小小的窗戶。柔軟且有質感的地毯提供了完美的背景，宛如一片海洋。船的周圍有各種其他玩具和兒童用品，暗示著一個充滿趣味的環境。這一場景捕捉到了童年的純真和想象力，玩具船的航行象徵著在一個異想天開的室內環境中的無盡冒險。

攝像機跟隨著一輛白色復古SUV，它帶有黑色車頂行李架，正加速駛上一條陡峭的土路，這條路位於陡峭的山坡上，兩旁是松樹。車輪揚起塵土，陽光灑在加速行駛的SUV上，給整個場景披上一層溫暖的光輝。土路緩緩蜿蜒向遠方，不見其他車輛。道路兩旁的樹木是紅杉，其間點綴著一片片綠色植被。從後方可以看到汽車輕鬆地沿著彎道行駛，彷彿正在崎嶇的地形中進行一場艱難的駕駛。土路本身被陡峭的山丘和山脈環繞，上方是晴朗的藍天，飄著縷縷白雲。