CogVideoX-2b开源视频生成模型 - 入门首选，低运行与开发成本！

首页

Cogvideox 2b

由 rttrsabc 开发

CogVideoX是源自清影的视频生成模型的开源版本，2B版本为入门级模型，平衡兼容性，运行和二次开发成本低。

文本生成视频英语开源协议:Apache-2.0 #文本生成视频 #高分辨率生成 #多帧连贯性

下载量 22

发布时间 : 9/9/2024

模型简介

CogVideoX是一个文本生成视频的扩散模型，能够根据文本描述生成6秒、8fps、720x480分辨率的视频。

模型特点

低显存需求

支持多种量化方式，最低可在3.6GB显存的GPU上运行

多精度支持

支持FP16、BF16、FP32、FP8、INT8等多种推理精度

优化推理

通过diffusers库提供多种显存优化方案，适应不同硬件环境

模型能力

文本到视频生成

视频内容创作

创意内容生成

使用案例

创意内容创作

动画短片制作

根据文本描述生成创意动画短片

可生成6秒、8fps的720x480分辨率视频

广告创意生成

快速生成产品展示视频创意

教育

教学视频生成

根据教学内容生成辅助视频

🚀 CogVideoX-2B

CogVideoX-2B 是一款视频生成模型，它基于 Transformer 架构，能够根据文本描述生成高质量的视频内容。该模型具有多种特性，如不同的推理精度、显存消耗和推理速度等，适用于多种场景。

🚀 快速开始

这个模型支持使用 huggingface diffusers 库进行部署。你可以按照以下步骤进行部署：

我们建议你访问我们的 GitHub，查看相关的提示优化和转换，以获得更好的体验。

安装所需的依赖项

# diffusers>=0.30.1
# transformers>=0.44.0
# accelerate>=0.33.0 (建议从源码安装)
# imageio-ffmpeg>=0.5.1
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

运行代码

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

✨ 主要特性

模型介绍

CogVideoX 是源自 QingYing 的视频生成模型的开源版本。以下表格展示了我们目前提供的视频生成模型列表及其基础信息：

模型名称	CogVideoX-2B (本仓库)	CogVideoX-5B
模型描述	入门级模型，兼顾兼容性。运行和二次开发成本低。	更大的模型，具有更高的视频生成质量和更好的视觉效果。
推理精度	*FP16 (推荐)*，BF16，FP32，FP8，INT8，不支持 INT4	BF16 (推荐)，FP16，FP32，FP8*，INT8，不支持 INT4
单 GPU 显存消耗	SAT FP16: 18GB diffusers FP16: 从 4GB 起* diffusers INT8(torchao): 从 3.6GB 起*	SAT BF16: 26GB diffusers BF16: 从 5GB 起* diffusers INT8(torchao): 从 4.4GB 起*
多 GPU 推理显存消耗	FP16: 使用 diffusers 为 10GB*	BF16: 使用 diffusers 为 15GB*
推理速度 (Step = 50, FP/BF16)	单 A100: ~90 秒单 H100: ~45 秒	单 A100: ~180 秒单 H100: ~90 秒
微调精度	FP16	BF16
微调显存消耗 (每 GPU)	47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT)	63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)
提示语言	英语*	英语*
提示长度限制	226 个词元	226 个词元
视频长度	6 秒	6 秒
帧率	每秒 8 帧	每秒 8 帧
视频分辨率	720 x 480，不支持其他分辨率 (包括微调)	720 x 480，不支持其他分辨率 (包括微调)
位置编码	3d_sincos_pos_embed	3d_rope_pos_embed

数据说明

当使用 diffusers 库进行测试时，启用了 diffusers 库提供的所有优化。此解决方案尚未在除 NVIDIA A100 / H100 之外的设备上测试实际的显存/内存使用情况。一般来说，该解决方案可以适配所有 NVIDIA Ampere 架构 及以上的设备。如果禁用优化，显存使用量将显著增加，峰值显存使用量约为表格显示的 3 倍。不过，速度将提高 3 - 4 倍。你可以选择性地禁用一些优化，包括：

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

进行多 GPU 推理时，需要禁用 enable_model_cpu_offload() 优化。
使用 INT8 模型会降低推理速度。这是为了确保显存较低的 GPU 能够正常进行推理，同时保持最小的视频质量损失，尽管推理速度会显著下降。
2B 模型使用 FP16 精度进行训练，5B 模型使用 BF16 精度进行训练。我们建议使用模型训练时的精度进行推理。
PytorchAO 和 [Optimum - quanto](https://github.com/huggingface/optimum - quanto/) 可用于对文本编码器、Transformer 和 VAE 模块进行量化，以降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或显存较小的 GPU 上运行模型成为可能！值得注意的是，TorchAO 量化与 torch.compile 完全兼容，这可以显著提高推理速度。FP8 精度必须在 NVIDIA H100 或更高版本的设备上使用，这需要从源码安装 torch、torchao、diffusers 和 accelerate Python 包。建议使用 CUDA 12.4。
推理速度测试也使用了上述显存优化方案。如果不进行显存优化，推理速度将提高约 10%。只有 diffusers 版本的模型支持量化。
该模型仅支持英文输入；其他语言可以在细化时由大模型翻译成英文。

注意事项

使用 SAT 对 SAT 版本的模型进行推理和微调。欢迎访问我们的 GitHub 了解更多信息。

💻 使用示例

基础用法

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

高级用法

import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
from transformers import T5EncoderModel
from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight

quantization = int8_weight_only

text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
quantize_(text_encoder, quantization())

transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
quantize_(transformer, quantization())

vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.bfloat16)
quantize_(vae, quantization())

# 创建管道并运行推理
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

此外，使用 PytorchAO 时，可以将模型序列化并存储为量化数据类型以节省磁盘空间。在以下链接中查找示例和基准测试：

📚 详细文档

欢迎访问我们的 github，你将在那里找到：

更详细的技术细节和代码解释。
提示词的优化和转换。
SAT 版本模型的推理和微调，甚至预发布内容。
项目更新日志动态，更多互动机会。
CogVideoX 工具链，帮助你更好地使用模型。
INT8 模型推理代码支持。

📄 许可证

CogVideoX-2B 模型（包括其对应的 Transformers 模块和 VAE 模块）根据 Apache 2.0 许可证发布。

CogVideoX-5B 模型（Transformers 模块）根据 CogVideoX 许可证发布。

📚 引用

@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}

示例展示

📄 中文阅读 | 🤗 Huggingface Space | 🌐 Github | 📜 arxiv

📍 访问 QingYing 和 API 平台体验商业视频生成模型。

Video Gallery with Captions

可以看到一艘精致的木制玩具船，它有着雕刻精美的桅杆和船帆，正平稳地在一块柔软的蓝色地毯上滑行，这块地毯宛如海浪。船身漆成了浓郁的棕色，还有小小的窗户。柔软且有质感的地毯提供了完美的背景，宛如一片海洋。船的周围摆放着各种其他玩具和儿童用品，暗示着一个充满趣味的环境。这一场景捕捉到了童年的纯真和想象力，玩具船的航行象征着在一个充满奇幻的室内环境中的无尽冒险。

摄像机跟随着一辆白色复古 SUV，它配有黑色车顶行李架，正加速驶上一条陡峭的土路，这条路位于陡峭的山坡上，两旁是松树。车轮扬起灰尘，阳光洒在加速行驶的 SUV 上，为整个场景披上了一层温暖的光辉。土路缓缓蜿蜒向远方，视野中没有其他汽车或车辆。道路两旁的树木是红杉，其间散落着一片片绿色植被。从后方可以轻松看到汽车沿着弯道行驶，仿佛它正在崎岖的地形中进行一场艰难的驾驶。土路本身被陡峭的山丘和山脉环绕，上方是晴朗的蓝天，飘着缕缕白云。