CogView3-Plus-3B開源文本生成圖像模型 - 支持512至2048像素創作

首頁

Cogview3 Plus 3B

由THUDM開發

CogView3-Plus-3B是CogView3的DiT版本，支持512至2048像素的文本生成圖像功能。

文本生成圖像英語開源協議:Apache-2.0 #高分辨率圖像生成 #接力擴散技術 #中文提示支持

下載量 385

發布時間 : 10/4/2024

模型概述

CogView3-Plus-3B是一個文本生成圖像的模型，支持高分辨率圖像生成，適用於各種創意和設計場景。

模型特點

高分辨率圖像生成

支持512至2048像素的高分辨率圖像生成，滿足多種應用需求。

快速推理

在A100設備上測試，推理速度為1秒/步，高效生成圖像。

顯存優化

支持CPU卸載和切片技術，顯著降低顯存佔用，適用於不同硬件環境。

模型能力

文本生成圖像

高分辨率圖像生成

創意設計

使用案例

創意設計

跑車設計

生成櫻桃紅色跑車的高分辨率圖像，展示流線型車身和細節設計。

高質量圖像，可用於設計展示和創意靈感。

廣告與營銷

產品展示

生成產品的高分辨率圖像，用於廣告和營銷材料。

吸引人的產品圖像，提升營銷效果。

🚀 CogView3-Plus-3B

CogView3-Plus-3B 是一款文本到圖像的生成模型，支持生成 512 到 2048px 的圖像，具有高效的推理速度和靈活的分辨率設置。

📄 中文閱讀 | 🤗 Hugging Face Space | 🌐 Github | 📜 arxiv

📍 訪問清言和 API平臺體驗更大規模的商業視頻生成模型。

🚀 快速開始

首先，確保從源代碼安裝 diffusers 庫：

pip install git+https://github.com/huggingface/diffusers.git

然後，運行以下代碼：

from diffusers import CogView3PlusPipeline
import torch

pipe = CogView3PlusPipeline.from_pretrained("THUDM/CogView3-Plus-3B", torch_dtype=torch.float16).to("cuda")

# 啟用它以減少 GPU 內存使用
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
    prompt=prompt,
    guidance_scale=7.0,
    num_images_per_prompt=1,
    num_inference_steps=50,
    width=1024,
    height=1024,
).images[0]

image.save("cogview3.png")

更多內容和下載原始 SAT 權重，請訪問我們的 GitHub。

✨ 主要特性

分辨率靈活：寬度和高度必須在 512px 到 2048px 範圍內，且必須能被 32 整除。
推理速度快：1s / 步（在 A100 上測試）
精度可選：支持 BF16 / FP32（不支持 FP16，因為會導致溢出產生黑色圖像）

🔧 技術細節

推理要求和模型概述

此模型是 CogView3 的 DiT 版本，是一個文本到圖像的生成模型，支持生成 512 到 2048px 的圖像。

內存消耗

我們在 A100 設備上測試了幾種常見分辨率下的內存消耗，batchsize=1, BF16，如下表所示：

分辨率	enable_model_cpu_offload OFF	enable_model_cpu_offload ON
512 * 512	19GB	11GB
720 * 480	20GB	11GB
1024 * 1024	23GB	11GB
1280 * 720	24GB	11GB
2048 * 2048	25GB	11GB

📄 引用

🌟 如果您覺得我們的工作有幫助，請引用我們的論文並留下一個星標：

@article{zheng2024cogview3,
  title={Cogview3: Finer and faster text-to-image generation via relay diffusion},
  author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},
  journal={arXiv preprint arXiv:2403.05121},
  year={2024}
}