🚀 CogView4-6B
CogView4-6B 是一個文本到圖像的模型,可根據輸入的文本生成相應的圖像。該模型在多個基準測試中取得了優異的成績,具有較高的圖像生成質量和準確性。
🤗 模型空間 |
🌐 Github 倉庫 |
📜 CogView3 論文

🚀 快速開始
首先,確保你從源代碼安裝 diffusers
庫。
pip install git+https://github.com/huggingface/diffusers.git
cd diffusers
pip install -e .
然後,運行以下代碼:
from diffusers import CogView4Pipeline
pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
prompt=prompt,
guidance_scale=3.5,
num_images_per_prompt=1,
num_inference_steps=50,
width=1024,
height=1024,
).images[0]
image.save("cogview4.png")
✨ 主要特性
推理要求和模型介紹
- 分辨率:寬度和高度必須在
512px
到 2048px
之間,且能被 32
整除,並確保最大像素數不超過 2^21
像素。
- 精度:支持 BF16 / FP32(不支持 FP16,因為會導致溢出,生成全黑圖像)
使用 BF16
精度和 batchsize=4
進行測試,內存使用情況如下表所示:
分辨率 |
enable_model_cpu_offload 關閉 |
enable_model_cpu_offload 開啟 |
enable_model_cpu_offload 開啟 文本編碼器 4bit |
512 * 512 |
33GB |
20GB |
13G |
1280 * 720 |
35GB |
20GB |
13G |
1024 * 1024 |
35GB |
20GB |
13G |
1920 * 1280 |
39GB |
20GB |
14G |
模型指標
我們在多個基準測試中進行了測試,並取得了以下分數:
DPG-Bench
模型 |
總體得分 |
全局得分 |
實體得分 |
屬性得分 |
關係得分 |
其他得分 |
SDXL |
74.65 |
83.27 |
82.43 |
80.91 |
86.76 |
80.41 |
PixArt-alpha |
71.11 |
74.97 |
79.32 |
78.60 |
82.57 |
76.96 |
SD3-Medium |
84.08 |
87.90 |
91.01 |
88.83 |
80.70 |
88.68 |
DALL-E 3 |
83.50 |
90.97 |
89.61 |
88.39 |
90.58 |
89.83 |
Flux.1-dev |
83.79 |
85.80 |
86.79 |
89.98 |
90.04 |
89.90 |
Janus-Pro-7B |
84.19 |
86.90 |
88.90 |
89.40 |
89.32 |
89.48 |
CogView4-6B |
85.13 |
83.85 |
90.35 |
91.17 |
91.14 |
87.29 |
GenEval
模型 |
總體得分 |
單對象得分 |
雙對象得分 |
計數得分 |
顏色得分 |
位置得分 |
顏色屬性得分 |
SDXL |
0.55 |
0.98 |
0.74 |
0.39 |
0.85 |
0.15 |
0.23 |
PixArt-alpha |
0.48 |
0.98 |
0.50 |
0.44 |
0.80 |
0.08 |
0.07 |
SD3-Medium |
0.74 |
0.99 |
0.94 |
0.72 |
0.89 |
0.33 |
0.60 |
DALL-E 3 |
0.67 |
0.96 |
0.87 |
0.47 |
0.83 |
0.43 |
0.45 |
Flux.1-dev |
0.66 |
0.98 |
0.79 |
0.73 |
0.77 |
0.22 |
0.45 |
Janus-Pro-7B |
0.80 |
0.99 |
0.89 |
0.59 |
0.90 |
0.79 |
0.66 |
CogView4-6B |
0.73 |
0.99 |
0.86 |
0.66 |
0.79 |
0.48 |
0.58 |
T2I-CompBench
模型 |
顏色得分 |
形狀得分 |
紋理得分 |
2D 空間得分 |
3D 空間得分 |
數值得分 |
非空間剪輯得分 |
複雜三合一得分 |
SDXL |
0.5879 |
0.4687 |
0.5299 |
0.2133 |
0.3566 |
0.4988 |
0.3119 |
0.3237 |
PixArt-alpha |
0.6690 |
0.4927 |
0.6477 |
0.2064 |
0.3901 |
0.5058 |
0.3197 |
0.3433 |
SD3-Medium |
0.8132 |
0.5885 |
0.7334 |
0.3200 |
0.4084 |
0.6174 |
0.3140 |
0.3771 |
DALL-E 3 |
0.7785 |
0.6205 |
0.7036 |
0.2865 |
0.3744 |
0.5880 |
0.3003 |
0.3773 |
Flux.1-dev |
0.7572 |
0.5066 |
0.6300 |
0.2700 |
0.3992 |
0.6165 |
0.3065 |
0.3628 |
Janus-Pro-7B |
0.5145 |
0.3323 |
0.4069 |
0.1566 |
0.2753 |
0.4406 |
0.3137 |
0.3806 |
CogView4-6B |
0.7786 |
0.5880 |
0.6983 |
0.3075 |
0.3708 |
0.6626 |
0.3056 |
0.3869 |
中文文本準確性評估
模型 |
精確率 |
召回率 |
F1 分數 |
Pick@4 得分 |
Kolors |
0.6094 |
0.1886 |
0.2880 |
0.1633 |
CogView4-6B |
0.6969 |
0.5532 |
0.6168 |
0.3265 |
📄 許可證
該模型根據 Apache 2.0 許可證 發佈。
📚 引用
🌟 如果你覺得我們的工作有幫助,請考慮引用我們的論文並留下寶貴的 Star。
@article{zheng2024cogview3,
title={Cogview3: Finer and faster text-to-image generation via relay diffusion},
author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},
journal={arXiv preprint arXiv:2403.05121},
year={2024}
}