CogVideoX-2b開源視頻生成模型 - 入門首選，低運行與開發成本！

首頁

Cogvideox 2b

由rttrsabc開發

CogVideoX是源自清影的視頻生成模型的開源版本，2B版本為入門級模型，平衡兼容性，運行和二次開發成本低。

文本生成視頻英語開源協議:Apache-2.0 #文本生成視頻 #高分辨率生成 #多幀連貫性

下載量 22

發布時間 : 9/9/2024

模型概述

CogVideoX是一個文本生成視頻的擴散模型，能夠根據文本描述生成6秒、8fps、720x480分辨率的視頻。

模型特點

低顯存需求

支持多種量化方式，最低可在3.6GB顯存的GPU上運行

多精度支持

支持FP16、BF16、FP32、FP8、INT8等多種推理精度

優化推理

通過diffusers庫提供多種顯存優化方案，適應不同硬件環境

模型能力

文本到視頻生成

視頻內容創作

創意內容生成

使用案例

創意內容創作

動畫短片製作

根據文本描述生成創意動畫短片

可生成6秒、8fps的720x480分辨率視頻

廣告創意生成

快速生成產品展示視頻創意

教育

教學視頻生成

根據教學內容生成輔助視頻

🚀 CogVideoX-2B

CogVideoX-2B 是一款視頻生成模型，它基於 Transformer 架構，能夠根據文本描述生成高質量的視頻內容。該模型具有多種特性，如不同的推理精度、顯存消耗和推理速度等，適用於多種場景。

🚀 快速開始

這個模型支持使用 huggingface diffusers 庫進行部署。你可以按照以下步驟進行部署：

我們建議你訪問我們的 GitHub，查看相關的提示優化和轉換，以獲得更好的體驗。

安裝所需的依賴項

# diffusers>=0.30.1
# transformers>=0.44.0
# accelerate>=0.33.0 (建議從源碼安裝)
# imageio-ffmpeg>=0.5.1
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

運行代碼

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

✨ 主要特性

模型介紹

CogVideoX 是源自 QingYing 的視頻生成模型的開源版本。以下表格展示了我們目前提供的視頻生成模型列表及其基礎信息：

模型名稱	CogVideoX-2B (本倉庫)	CogVideoX-5B
模型描述	入門級模型，兼顧兼容性。運行和二次開發成本低。	更大的模型，具有更高的視頻生成質量和更好的視覺效果。
推理精度	*FP16 (推薦)*，BF16，FP32，FP8，INT8，不支持 INT4	BF16 (推薦)，FP16，FP32，FP8*，INT8，不支持 INT4
單 GPU 顯存消耗	SAT FP16: 18GB diffusers FP16: 從 4GB 起* diffusers INT8(torchao): 從 3.6GB 起*	SAT BF16: 26GB diffusers BF16: 從 5GB 起* diffusers INT8(torchao): 從 4.4GB 起*
多 GPU 推理顯存消耗	FP16: 使用 diffusers 為 10GB*	BF16: 使用 diffusers 為 15GB*
推理速度 (Step = 50, FP/BF16)	單 A100: ~90 秒單 H100: ~45 秒	單 A100: ~180 秒單 H100: ~90 秒
微調精度	FP16	BF16
微調顯存消耗 (每 GPU)	47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT)	63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)
提示語言	英語*	英語*
提示長度限制	226 個詞元	226 個詞元
視頻長度	6 秒	6 秒
幀率	每秒 8 幀	每秒 8 幀
視頻分辨率	720 x 480，不支持其他分辨率 (包括微調)	720 x 480，不支持其他分辨率 (包括微調)
位置編碼	3d_sincos_pos_embed	3d_rope_pos_embed

數據說明

當使用 diffusers 庫進行測試時，啟用了 diffusers 庫提供的所有優化。此解決方案尚未在除 NVIDIA A100 / H100 之外的設備上測試實際的顯存/內存使用情況。一般來說，該解決方案可以適配所有 NVIDIA Ampere 架構 及以上的設備。如果禁用優化，顯存使用量將顯著增加，峰值顯存使用量約為表格顯示的 3 倍。不過，速度將提高 3 - 4 倍。你可以選擇性地禁用一些優化，包括：

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

進行多 GPU 推理時，需要禁用 enable_model_cpu_offload() 優化。
使用 INT8 模型會降低推理速度。這是為了確保顯存較低的 GPU 能夠正常進行推理，同時保持最小的視頻質量損失，儘管推理速度會顯著下降。
2B 模型使用 FP16 精度進行訓練，5B 模型使用 BF16 精度進行訓練。我們建議使用模型訓練時的精度進行推理。
PytorchAO 和 [Optimum - quanto](https://github.com/huggingface/optimum - quanto/) 可用於對文本編碼器、Transformer 和 VAE 模塊進行量化，以降低 CogVideoX 的內存需求。這使得在免費的 T4 Colab 或顯存較小的 GPU 上運行模型成為可能！值得注意的是，TorchAO 量化與 torch.compile 完全兼容，這可以顯著提高推理速度。FP8 精度必須在 NVIDIA H100 或更高版本的設備上使用，這需要從源碼安裝 torch、torchao、diffusers 和 accelerate Python 包。建議使用 CUDA 12.4。
推理速度測試也使用了上述顯存優化方案。如果不進行顯存優化，推理速度將提高約 10%。只有 diffusers 版本的模型支持量化。
該模型僅支持英文輸入；其他語言可以在細化時由大模型翻譯成英文。

注意事項

使用 SAT 對 SAT 版本的模型進行推理和微調。歡迎訪問我們的 GitHub 瞭解更多信息。

💻 使用示例

基礎用法

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

高級用法

import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
from transformers import T5EncoderModel
from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight

quantization = int8_weight_only

text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
quantize_(text_encoder, quantization())

transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
quantize_(transformer, quantization())

vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.bfloat16)
quantize_(vae, quantization())

# 創建管道並運行推理
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

此外，使用 PytorchAO 時，可以將模型序列化並存儲為量化數據類型以節省磁盤空間。在以下鏈接中查找示例和基準測試：

📚 詳細文檔

歡迎訪問我們的 github，你將在那裡找到：

更詳細的技術細節和代碼解釋。
提示詞的優化和轉換。
SAT 版本模型的推理和微調，甚至預發佈內容。
項目更新日誌動態，更多互動機會。
CogVideoX 工具鏈，幫助你更好地使用模型。
INT8 模型推理代碼支持。

📄 許可證

CogVideoX-2B 模型（包括其對應的 Transformers 模塊和 VAE 模塊）根據 Apache 2.0 許可證發佈。

CogVideoX-5B 模型（Transformers 模塊）根據 CogVideoX 許可證發佈。

📚 引用

@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}

示例展示

📄 中文閱讀 | 🤗 Huggingface Space | 🌐 Github | 📜 arxiv

📍 訪問 QingYing 和 API 平臺體驗商業視頻生成模型。

Video Gallery with Captions

可以看到一艘精緻的木製玩具船，它有著雕刻精美的桅杆和船帆，正平穩地在一塊柔軟的藍色地毯上滑行，這塊地毯宛如海浪。船身漆成了濃郁的棕色，還有小小的窗戶。柔軟且有質感的地毯提供了完美的背景，宛如一片海洋。船的周圍擺放著各種其他玩具和兒童用品，暗示著一個充滿趣味的環境。這一場景捕捉到了童年的純真和想象力，玩具船的航行象徵著在一個充滿奇幻的室內環境中的無盡冒險。

攝像機跟隨著一輛白色復古 SUV，它配有黑色車頂行李架，正加速駛上一條陡峭的土路，這條路位於陡峭的山坡上，兩旁是松樹。車輪揚起灰塵，陽光灑在加速行駛的 SUV 上，為整個場景披上了一層溫暖的光輝。土路緩緩蜿蜒向遠方，視野中沒有其他汽車或車輛。道路兩旁的樹木是紅杉，其間散落著一片片綠色植被。從後方可以輕鬆看到汽車沿著彎道行駛，彷彿它正在崎嶇的地形中進行一場艱難的駕駛。土路本身被陡峭的山丘和山脈環繞，上方是晴朗的藍天，飄著縷縷白雲。