CogVideoX1.5 - 5B开源视频生成模型，免费支持高分辨率视频生成

首页

Cogvideox1.5 5B

由 THUDM 开发

CogVideoX 是一个类似于清影的开源视频生成模型，支持高分辨率视频生成

文本生成视频英语开源协议:其他 #高清视频生成 #多帧率支持 #多GPU优化

下载量 11.12k

发布时间 : 11/2/2024

模型简介

CogVideoX 是一个先进的视频生成模型，能够根据文本提示生成高质量的视频内容。该模型支持高分辨率视频生成（1360x768），并能够生成5秒或10秒的视频。

模型特点

高分辨率视频生成

支持生成1360x768分辨率的高质量视频

灵活的视频时长控制

可以生成5秒或10秒的视频，帧率为16帧/秒

多精度支持

支持BF16、FP16、FP32、FP8*、INT8等多种推理精度

高效推理优化

通过diffusers库实现显存优化，最低可在10GB显存的GPU上运行

模型能力

文本到视频生成

高分辨率视频生成

多时长视频生成

使用案例

创意内容生成

短视频创作

根据文本提示快速生成创意短视频内容

生成5-10秒的高质量视频

教育

教学视频生成

根据教学内容自动生成辅助视频

🚀 CogVideoX1.5-5B

CogVideoX1.5-5B是一个开源视频生成模型，类似于QingYing。它可以根据文本输入生成高质量的视频，为视频创作带来了新的可能性。

📄 中文阅读 | 🤗 Huggingface Space | 🌐 Github | 📜 arxiv

📍 访问 QingYing 和 API平台体验更大规模的商业视频生成模型。

✨ 主要特性

CogVideoX是一个类似于QingYing的开源视频生成模型。以下表格展示了我们目前提供的视频生成模型列表及其基础信息。

属性	详情
模型类型	视频生成模型
支持语言	英文
推理精度	BF16 (推荐)、FP16、FP32、FP8*、INT8，不支持：INT4
单GPU内存使用	不同模型和精度下有所不同，如CogVideoX1.5-5B使用diffusers BF16时从10GB*起
多GPU内存使用	不同模型和精度下有所不同，如CogVideoX1.5-5B使用diffusers BF16时为24GB*
推理速度	不同模型和硬件下有所不同，如CogVideoX1.5-5B单A100约1000秒（5秒视频）
提示语言	英文*
提示令牌限制	不同模型有所不同，如CogVideoX1.5-5B为224 Tokens
视频长度	不同模型有所不同，如CogVideoX1.5-5B为5秒或10秒
帧率	不同模型有所不同，如CogVideoX1.5-5B为16帧/秒
位置编码	不同模型有所不同，如CogVideoX1.5-5B为3d_rope_pos_embed
下载链接 (Diffusers)	多个平台，如HuggingFace、ModelScope、WiseModel
下载链接 (SAT)	多个平台，如HuggingFace、ModelScope、WiseModel

数据说明

使用diffusers库进行测试时，启用了库中包含的所有优化。此方案尚未在非NVIDIA A100/H100设备上进行测试，通常应适用于所有NVIDIA安培架构或更高版本的设备。禁用优化会使VRAM使用量增加两倍，但速度会提高3 - 4倍。您可以选择性地禁用某些优化，包括：

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

在多GPU推理中，需要禁用enable_sequential_cpu_offload()优化。
使用INT8模型会降低推理速度，在满足较低VRAM GPU要求的同时，视频质量的下降最小，但代价是速度显著降低。
可以使用PytorchAO和[Optimum - quanto](https://github.com/huggingface/optimum - quanto/)对文本编码器、Transformer和VAE模块进行量化，降低CogVideoX的内存要求，使模型能够在较小VRAM的GPU上运行。TorchAO量化与torch.compile完全兼容，可显著提高推理速度。NVIDIA H100及以上设备需要FP8精度，这需要从源代码安装torch、torchao、diffusers和accelerate。建议使用CUDA 12.4。
推理速度测试也使用了上述VRAM优化，不进行优化时，速度大约提高10%。只有diffusers版本的模型支持量化。
模型仅支持英文输入，在编写提示时，其他语言应使用更大的模型翻译成英文。

注意事项

使用SAT对SAT版本的模型进行推理和微调。更多详细信息请查看我们的GitHub。

🚀 快速开始

本模型支持使用Hugging Face diffusers库进行部署。您可以按照以下步骤开始使用。

我们建议您访问我们的GitHub查看提示优化和转换，以获得更好的体验。

📦 安装指南

安装所需的依赖项：

# diffusers (from source)
# transformers>=4.46.2
# accelerate>=1.1.1
# imageio-ffmpeg>=0.5.1
pip install git+https://github.com/huggingface/diffusers
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

💻 使用示例

基础用法

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B",
    torch_dtype=torch.bfloat16
)

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=81,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

高级用法

# 使用PytorchAO和Optimum-quanto进行量化推理
# 开始前，需要从GitHub源代码安装PytorchAO和PyTorch Nightly。
# 在下一个版本发布之前，才需要进行源代码和夜间版本的安装。

import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video
from transformers import T5EncoderModel
from torchao.quantization import quantize_, int8_weight_only

quantization = int8_weight_only

text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="text_encoder",
                                              torch_dtype=torch.bfloat16)
quantize_(text_encoder, quantization())

transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="transformer",
                                                          torch_dtype=torch.bfloat16)
quantize_(transformer, quantization())

vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="vae", torch_dtype=torch.bfloat16)
quantize_(vae, quantization())

# 创建管道并运行推理
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B",
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=torch.bfloat16,
)

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

prompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=81,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

此外，这些模型可以使用PytorchAO以量化数据类型进行序列化和存储，以节省磁盘空间。您可以在以下链接找到示例和基准测试：

📚 详细文档

您可以随时访问我们的GitHub，在那里您将找到：

更详细的技术解释和代码。
优化的提示示例和转换。
模型推理和微调的详细代码。
项目更新日志和更多互动机会。
CogVideoX工具链，帮助您更好地使用模型。
INT8模型推理代码。

📄 许可证

本模型根据CogVideoX LICENSE发布。

引用

@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}