🚀 CogView4-6B
CogView4-6B 是一个文本到图像的模型,可根据输入的文本生成相应的图像。该模型在多个基准测试中取得了优异的成绩,具有较高的图像生成质量和准确性。
🤗 模型空间 |
🌐 Github 仓库 |
📜 CogView3 论文

🚀 快速开始
首先,确保你从源代码安装 diffusers
库。
pip install git+https://github.com/huggingface/diffusers.git
cd diffusers
pip install -e .
然后,运行以下代码:
from diffusers import CogView4Pipeline
pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
prompt=prompt,
guidance_scale=3.5,
num_images_per_prompt=1,
num_inference_steps=50,
width=1024,
height=1024,
).images[0]
image.save("cogview4.png")
✨ 主要特性
推理要求和模型介绍
- 分辨率:宽度和高度必须在
512px
到 2048px
之间,且能被 32
整除,并确保最大像素数不超过 2^21
像素。
- 精度:支持 BF16 / FP32(不支持 FP16,因为会导致溢出,生成全黑图像)
使用 BF16
精度和 batchsize=4
进行测试,内存使用情况如下表所示:
分辨率 |
enable_model_cpu_offload 关闭 |
enable_model_cpu_offload 开启 |
enable_model_cpu_offload 开启 文本编码器 4bit |
512 * 512 |
33GB |
20GB |
13G |
1280 * 720 |
35GB |
20GB |
13G |
1024 * 1024 |
35GB |
20GB |
13G |
1920 * 1280 |
39GB |
20GB |
14G |
模型指标
我们在多个基准测试中进行了测试,并取得了以下分数:
DPG-Bench
模型 |
总体得分 |
全局得分 |
实体得分 |
属性得分 |
关系得分 |
其他得分 |
SDXL |
74.65 |
83.27 |
82.43 |
80.91 |
86.76 |
80.41 |
PixArt-alpha |
71.11 |
74.97 |
79.32 |
78.60 |
82.57 |
76.96 |
SD3-Medium |
84.08 |
87.90 |
91.01 |
88.83 |
80.70 |
88.68 |
DALL-E 3 |
83.50 |
90.97 |
89.61 |
88.39 |
90.58 |
89.83 |
Flux.1-dev |
83.79 |
85.80 |
86.79 |
89.98 |
90.04 |
89.90 |
Janus-Pro-7B |
84.19 |
86.90 |
88.90 |
89.40 |
89.32 |
89.48 |
CogView4-6B |
85.13 |
83.85 |
90.35 |
91.17 |
91.14 |
87.29 |
GenEval
模型 |
总体得分 |
单对象得分 |
双对象得分 |
计数得分 |
颜色得分 |
位置得分 |
颜色属性得分 |
SDXL |
0.55 |
0.98 |
0.74 |
0.39 |
0.85 |
0.15 |
0.23 |
PixArt-alpha |
0.48 |
0.98 |
0.50 |
0.44 |
0.80 |
0.08 |
0.07 |
SD3-Medium |
0.74 |
0.99 |
0.94 |
0.72 |
0.89 |
0.33 |
0.60 |
DALL-E 3 |
0.67 |
0.96 |
0.87 |
0.47 |
0.83 |
0.43 |
0.45 |
Flux.1-dev |
0.66 |
0.98 |
0.79 |
0.73 |
0.77 |
0.22 |
0.45 |
Janus-Pro-7B |
0.80 |
0.99 |
0.89 |
0.59 |
0.90 |
0.79 |
0.66 |
CogView4-6B |
0.73 |
0.99 |
0.86 |
0.66 |
0.79 |
0.48 |
0.58 |
T2I-CompBench
模型 |
颜色得分 |
形状得分 |
纹理得分 |
2D 空间得分 |
3D 空间得分 |
数值得分 |
非空间剪辑得分 |
复杂三合一得分 |
SDXL |
0.5879 |
0.4687 |
0.5299 |
0.2133 |
0.3566 |
0.4988 |
0.3119 |
0.3237 |
PixArt-alpha |
0.6690 |
0.4927 |
0.6477 |
0.2064 |
0.3901 |
0.5058 |
0.3197 |
0.3433 |
SD3-Medium |
0.8132 |
0.5885 |
0.7334 |
0.3200 |
0.4084 |
0.6174 |
0.3140 |
0.3771 |
DALL-E 3 |
0.7785 |
0.6205 |
0.7036 |
0.2865 |
0.3744 |
0.5880 |
0.3003 |
0.3773 |
Flux.1-dev |
0.7572 |
0.5066 |
0.6300 |
0.2700 |
0.3992 |
0.6165 |
0.3065 |
0.3628 |
Janus-Pro-7B |
0.5145 |
0.3323 |
0.4069 |
0.1566 |
0.2753 |
0.4406 |
0.3137 |
0.3806 |
CogView4-6B |
0.7786 |
0.5880 |
0.6983 |
0.3075 |
0.3708 |
0.6626 |
0.3056 |
0.3869 |
中文文本准确性评估
模型 |
精确率 |
召回率 |
F1 分数 |
Pick@4 得分 |
Kolors |
0.6094 |
0.1886 |
0.2880 |
0.1633 |
CogView4-6B |
0.6969 |
0.5532 |
0.6168 |
0.3265 |
📄 许可证
该模型根据 Apache 2.0 许可证 发布。
📚 引用
🌟 如果你觉得我们的工作有帮助,请考虑引用我们的论文并留下宝贵的 Star。
@article{zheng2024cogview3,
title={Cogview3: Finer and faster text-to-image generation via relay diffusion},
author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},
journal={arXiv preprint arXiv:2403.05121},
year={2024}
}