CogView3-Plus-3B开源文本生成图像模型 - 支持512至2048像素创作

首页

Cogview3 Plus 3B

由 THUDM 开发

CogView3-Plus-3B是CogView3的DiT版本，支持512至2048像素的文本生成图像功能。

文本生成图像英语开源协议:Apache-2.0 #高分辨率图像生成 #接力扩散技术 #中文提示支持

下载量 385

发布时间 : 10/4/2024

模型简介

CogView3-Plus-3B是一个文本生成图像的模型，支持高分辨率图像生成，适用于各种创意和设计场景。

模型特点

高分辨率图像生成

支持512至2048像素的高分辨率图像生成，满足多种应用需求。

快速推理

在A100设备上测试，推理速度为1秒/步，高效生成图像。

显存优化

支持CPU卸载和切片技术，显著降低显存占用，适用于不同硬件环境。

模型能力

文本生成图像

高分辨率图像生成

创意设计

使用案例

创意设计

跑车设计

生成樱桃红色跑车的高分辨率图像，展示流线型车身和细节设计。

高质量图像，可用于设计展示和创意灵感。

广告与营销

产品展示

生成产品的高分辨率图像，用于广告和营销材料。

吸引人的产品图像，提升营销效果。

🚀 CogView3-Plus-3B

CogView3-Plus-3B 是一款文本到图像的生成模型，支持生成 512 到 2048px 的图像，具有高效的推理速度和灵活的分辨率设置。

📄 中文阅读 | 🤗 Hugging Face Space | 🌐 Github | 📜 arxiv

📍 访问清言和 API平台体验更大规模的商业视频生成模型。

🚀 快速开始

首先，确保从源代码安装 diffusers 库：

pip install git+https://github.com/huggingface/diffusers.git

然后，运行以下代码：

from diffusers import CogView3PlusPipeline
import torch

pipe = CogView3PlusPipeline.from_pretrained("THUDM/CogView3-Plus-3B", torch_dtype=torch.float16).to("cuda")

# 启用它以减少 GPU 内存使用
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
    prompt=prompt,
    guidance_scale=7.0,
    num_images_per_prompt=1,
    num_inference_steps=50,
    width=1024,
    height=1024,
).images[0]

image.save("cogview3.png")

更多内容和下载原始 SAT 权重，请访问我们的 GitHub。

✨ 主要特性

分辨率灵活：宽度和高度必须在 512px 到 2048px 范围内，且必须能被 32 整除。
推理速度快：1s / 步（在 A100 上测试）
精度可选：支持 BF16 / FP32（不支持 FP16，因为会导致溢出产生黑色图像）

🔧 技术细节

推理要求和模型概述

此模型是 CogView3 的 DiT 版本，是一个文本到图像的生成模型，支持生成 512 到 2048px 的图像。

内存消耗

我们在 A100 设备上测试了几种常见分辨率下的内存消耗，batchsize=1, BF16，如下表所示：

分辨率	enable_model_cpu_offload OFF	enable_model_cpu_offload ON
512 * 512	19GB	11GB
720 * 480	20GB	11GB
1024 * 1024	23GB	11GB
1280 * 720	24GB	11GB
2048 * 2048	25GB	11GB

📄 引用

🌟 如果您觉得我们的工作有帮助，请引用我们的论文并留下一个星标：

@article{zheng2024cogview3,
  title={Cogview3: Finer and faster text-to-image generation via relay diffusion},
  author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},
  journal={arXiv preprint arXiv:2403.05121},
  year={2024}
}