Stable Cascade开源文本生成图像模型 - 快速推理低成本训练高效出图

首页

Stable Cascade

由 stabilityai 开发

基于Würstchen架构的高效文本生成图像模型，通过42倍压缩因子实现快速推理和低成本训练

文本生成图像开源协议:其他 #高效图像生成 #高压缩潜在空间 #多阶段扩散模型

下载量 24.96k

发布时间 : 2/6/2024

模型简介

Stable Cascade是一个三阶段的文本到图像生成模型，通过高度压缩的潜在空间显著降低计算成本，同时保持高质量的图像生成能力

模型特点

高效压缩架构

采用42倍压缩因子（1024x1024→24x24），相比Stable Diffusion的8倍压缩显著提升效率

低成本训练

早期版本相比Stable Diffusion 1.5降低16倍训练成本

兼容扩展功能

支持LoRA、ControlNet、IP-Adapter、LCM等扩展功能

多版本选择

提供不同参数规模的模型版本（10亿/36亿参数等）满足不同需求

模型能力

文本生成图像

高分辨率图像生成（1024x1024）

快速推理

图像重建

使用案例

艺术创作

概念艺术生成

根据文本描述生成创意概念艺术图像

高质量的艺术作品

设计应用

产品原型设计

快速生成产品设计原型图像

加速设计流程

教育研究

生成模型研究

研究高效生成模型的架构和性能

🚀 Stable Cascade

Stable Cascade是一个文本到图像的生成模型，它基于Würstchen架构，在更小的潜在空间中运行，能实现更快的推理速度和更低的训练成本。该模型适用于对效率要求较高的场景，并且支持各种已知的扩展方法。

🚀 快速开始

若要使用StableCascadeDecoderPipeline搭配torch.bfloat16数据类型，你需要安装PyTorch 2.2.0或更高版本。由于StableCascadeCombinedPipeline内部调用了StableCascadeDecoderPipeline，因此使用torch.bfloat16时也需要PyTorch 2.2.0或更高版本。

如果你的环境无法安装PyTorch 2.2.0或更高版本，StableCascadeDecoderPipeline可以单独使用torch.float16数据类型。你可以下载该管道的全精度或bf16变体权重，并将权重转换为torch.float16。

pip install diffusers

import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

prompt = "an image of a shiba inu, donning a spacesuit and helmet"
negative_prompt = ""

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16)

prior.enable_model_cpu_offload()
prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=4.0,
    num_images_per_prompt=1,
    num_inference_steps=20
)

decoder.enable_model_cpu_offload()
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings.to(torch.float16),
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=10
).images[0]
decoder_output.save("cascade.png")

✨ 主要特性

高效推理与训练：基于Würstchen架构，在更小的潜在空间中运行，相比Stable Diffusion，推理速度更快，训练成本更低。Stable Diffusion使用8的压缩因子，将1024x1024的图像编码为128x128，而Stable Cascade实现了42的压缩因子，可将1024x1024的图像编码为24x24，同时保持清晰的重建效果。
支持多种扩展：支持所有已知的扩展方法，如微调、LoRA、ControlNet、IP-Adapter、LCM等。
性能优越：在几乎所有比较中，Stable Cascade在提示对齐和美学质量方面表现最佳。

📦 安装指南

pip install diffusers

💻 使用示例

基础用法

import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

prompt = "an image of a shiba inu, donning a spacesuit and helmet"
negative_prompt = ""

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16)

prior.enable_model_cpu_offload()
prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=4.0,
    num_images_per_prompt=1,
    num_inference_steps=20
)

decoder.enable_model_cpu_offload()
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings.to(torch.float16),
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=10
).images[0]
decoder_output.save("cascade.png")

高级用法

使用Stage B和Stage C模型的精简版本

import torch
from diffusers import (
    StableCascadeDecoderPipeline,
    StableCascadePriorPipeline,
    StableCascadeUNet,
)

prompt = "an image of a shiba inu, donning a spacesuit and helmet"
negative_prompt = ""

prior_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade-prior", subfolder="prior_lite")
decoder_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade", subfolder="decoder_lite")

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet)

prior.enable_model_cpu_offload()
prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=4.0,
    num_images_per_prompt=1,
    num_inference_steps=20
)

decoder.enable_model_cpu_offload()
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings,
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=10
).images[0]
decoder_output.save("cascade.png")

使用`from_single_file`加载原始检查点

import torch
from diffusers import (
    StableCascadeDecoderPipeline,
    StableCascadePriorPipeline,
    StableCascadeUNet,
)

prompt = "an image of a shiba inu, donning a spacesuit and helmet"
negative_prompt = ""

prior_unet = StableCascadeUNet.from_single_file(
    "https://huggingface.co/stabilityai/stable-cascade/resolve/main/stage_c_bf16.safetensors",
    torch_dtype=torch.bfloat16
)
decoder_unet = StableCascadeUNet.from_single_file(
    "https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_bf16.safetensors",
    torch_dtype=torch.bfloat16
)

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet, torch_dtype=torch.bfloat16)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet, torch_dtype=torch.bfloat16)

prior.enable_model_cpu_offload()
prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=4.0,
    num_images_per_prompt=1,
    num_inference_steps=20
)

decoder.enable_model_cpu_offload()
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings,
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=10
).images[0]
decoder_output.save("cascade-single-file.png")

使用`StableCascadeCombinedPipeline`

from diffusers import StableCascadeCombinedPipeline

pipe = StableCascadeCombinedPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.bfloat16)

prompt = "an image of a shiba inu, donning a spacesuit and helmet"
pipe(
    prompt=prompt,
    negative_prompt="",
    num_inference_steps=10,
    prior_num_inference_steps=20,
    prior_guidance_scale=3.0,
    width=1024,
    height=1024,
).images[0].save("cascade-combined.png")

📚 详细文档

模型详情

模型描述

Stable Cascade是一个经过训练的扩散模型，可根据文本提示生成图像。

属性	详情
开发方	Stability AI
资助方	Stability AI
模型类型	生成式文本到图像模型

模型来源

出于研究目的，我们推荐使用StableCascade的Github仓库（https://github.com/Stability-AI/StableCascade）。

仓库地址：https://github.com/Stability-AI/StableCascade
论文地址：https://openreview.net/forum?id=gU58d5QeGv

模型概述

Stable Cascade由三个模型组成：Stage A、Stage B和Stage C，它们构成一个级联结构来生成图像，因此得名“Stable Cascade”。 Stage A和Stage B用于压缩图像，类似于Stable Diffusion中VAE的作用。然而，通过这种设置，可以实现更高的图像压缩率。Stable Diffusion模型使用8的空间压缩因子，将分辨率为1024 x 1024的图像编码为128 x 128，而Stable Cascade实现了42的压缩因子，将1024 x 1024的图像编码为24 x 24，同时能够准确解码图像。这带来了更低的训练和推理成本的巨大优势。此外，Stage C负责根据文本提示生成24 x 24的小潜在空间图像。

本次发布提供了Stage C的两个检查点、Stage B的两个检查点和Stage A的一个检查点。Stage C有10亿和36亿参数的版本，我们强烈建议使用36亿参数的版本，因为大部分微调工作都集中在这个版本上。Stage B的两个版本分别有7亿和15亿参数，两者都能取得很好的效果，但15亿参数的版本在重建小细节方面表现更出色。因此，使用每个阶段的较大变体可以获得最佳效果。最后，Stage A包含2000万参数，由于其规模较小，参数是固定的。

评估

根据评估，Stable Cascade在几乎所有比较中，在提示对齐和美学质量方面表现最佳。下图展示了使用混合的parti提示（链接）和美学提示进行的人工评估结果。具体来说，将Stable Cascade（30次推理步骤）与Playground v2（50次推理步骤）、SDXL（50次推理步骤）、SDXL Turbo（1次推理步骤）和Würstchen v2（30次推理步骤）进行了比较。