Stable Cascade開源文本生成圖像模型 - 快速推理低成本訓練高效出圖

首頁

Stable Cascade

由stabilityai開發

基於Würstchen架構的高效文本生成圖像模型，通過42倍壓縮因子實現快速推理和低成本訓練

文本生成圖像開源協議:其他 #高效圖像生成 #高壓縮潛在空間 #多階段擴散模型

下載量 24.96k

發布時間 : 2/6/2024

模型概述

Stable Cascade是一個三階段的文本到圖像生成模型，通過高度壓縮的潛在空間顯著降低計算成本，同時保持高質量的圖像生成能力

模型特點

高效壓縮架構

採用42倍壓縮因子（1024x1024→24x24），相比Stable Diffusion的8倍壓縮顯著提升效率

低成本訓練

早期版本相比Stable Diffusion 1.5降低16倍訓練成本

兼容擴展功能

支持LoRA、ControlNet、IP-Adapter、LCM等擴展功能

多版本選擇

提供不同參數規模的模型版本（10億/36億參數等）滿足不同需求

模型能力

文本生成圖像

高分辨率圖像生成（1024x1024）

快速推理

圖像重建

使用案例

藝術創作

概念藝術生成

根據文本描述生成創意概念藝術圖像

高質量的藝術作品

設計應用

產品原型設計

快速生成產品設計原型圖像

加速設計流程

教育研究

生成模型研究

研究高效生成模型的架構和性能

🚀 Stable Cascade

Stable Cascade是一個文本到圖像的生成模型，它基於Würstchen架構，在更小的潛在空間中運行，能實現更快的推理速度和更低的訓練成本。該模型適用於對效率要求較高的場景，並且支持各種已知的擴展方法。

🚀 快速開始

若要使用StableCascadeDecoderPipeline搭配torch.bfloat16數據類型，你需要安裝PyTorch 2.2.0或更高版本。由於StableCascadeCombinedPipeline內部調用了StableCascadeDecoderPipeline，因此使用torch.bfloat16時也需要PyTorch 2.2.0或更高版本。

如果你的環境無法安裝PyTorch 2.2.0或更高版本，StableCascadeDecoderPipeline可以單獨使用torch.float16數據類型。你可以下載該管道的全精度或bf16變體權重，並將權重轉換為torch.float16。

pip install diffusers

import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

prompt = "an image of a shiba inu, donning a spacesuit and helmet"
negative_prompt = ""

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16)

prior.enable_model_cpu_offload()
prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=4.0,
    num_images_per_prompt=1,
    num_inference_steps=20
)

decoder.enable_model_cpu_offload()
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings.to(torch.float16),
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=10
).images[0]
decoder_output.save("cascade.png")

✨ 主要特性

高效推理與訓練：基於Würstchen架構，在更小的潛在空間中運行，相比Stable Diffusion，推理速度更快，訓練成本更低。Stable Diffusion使用8的壓縮因子，將1024x1024的圖像編碼為128x128，而Stable Cascade實現了42的壓縮因子，可將1024x1024的圖像編碼為24x24，同時保持清晰的重建效果。
支持多種擴展：支持所有已知的擴展方法，如微調、LoRA、ControlNet、IP-Adapter、LCM等。
性能優越：在幾乎所有比較中，Stable Cascade在提示對齊和美學質量方面表現最佳。

📦 安裝指南

pip install diffusers

💻 使用示例

基礎用法

import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

prompt = "an image of a shiba inu, donning a spacesuit and helmet"
negative_prompt = ""

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16)

prior.enable_model_cpu_offload()
prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=4.0,
    num_images_per_prompt=1,
    num_inference_steps=20
)

decoder.enable_model_cpu_offload()
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings.to(torch.float16),
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=10
).images[0]
decoder_output.save("cascade.png")

高級用法

使用Stage B和Stage C模型的精簡版本

import torch
from diffusers import (
    StableCascadeDecoderPipeline,
    StableCascadePriorPipeline,
    StableCascadeUNet,
)

prompt = "an image of a shiba inu, donning a spacesuit and helmet"
negative_prompt = ""

prior_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade-prior", subfolder="prior_lite")
decoder_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade", subfolder="decoder_lite")

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet)

prior.enable_model_cpu_offload()
prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=4.0,
    num_images_per_prompt=1,
    num_inference_steps=20
)

decoder.enable_model_cpu_offload()
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings,
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=10
).images[0]
decoder_output.save("cascade.png")

使用`from_single_file`加載原始檢查點

import torch
from diffusers import (
    StableCascadeDecoderPipeline,
    StableCascadePriorPipeline,
    StableCascadeUNet,
)

prompt = "an image of a shiba inu, donning a spacesuit and helmet"
negative_prompt = ""

prior_unet = StableCascadeUNet.from_single_file(
    "https://huggingface.co/stabilityai/stable-cascade/resolve/main/stage_c_bf16.safetensors",
    torch_dtype=torch.bfloat16
)
decoder_unet = StableCascadeUNet.from_single_file(
    "https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_bf16.safetensors",
    torch_dtype=torch.bfloat16
)

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet, torch_dtype=torch.bfloat16)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet, torch_dtype=torch.bfloat16)

prior.enable_model_cpu_offload()
prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=4.0,
    num_images_per_prompt=1,
    num_inference_steps=20
)

decoder.enable_model_cpu_offload()
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings,
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=10
).images[0]
decoder_output.save("cascade-single-file.png")

使用`StableCascadeCombinedPipeline`

from diffusers import StableCascadeCombinedPipeline

pipe = StableCascadeCombinedPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.bfloat16)

prompt = "an image of a shiba inu, donning a spacesuit and helmet"
pipe(
    prompt=prompt,
    negative_prompt="",
    num_inference_steps=10,
    prior_num_inference_steps=20,
    prior_guidance_scale=3.0,
    width=1024,
    height=1024,
).images[0].save("cascade-combined.png")

📚 詳細文檔

模型詳情

模型描述

Stable Cascade是一個經過訓練的擴散模型，可根據文本提示生成圖像。

屬性	詳情
開發方	Stability AI
資助方	Stability AI
模型類型	生成式文本到圖像模型

模型來源

出於研究目的，我們推薦使用StableCascade的Github倉庫（https://github.com/Stability-AI/StableCascade）。

倉庫地址：https://github.com/Stability-AI/StableCascade
論文地址：https://openreview.net/forum?id=gU58d5QeGv

模型概述

Stable Cascade由三個模型組成：Stage A、Stage B和Stage C，它們構成一個級聯結構來生成圖像，因此得名“Stable Cascade”。 Stage A和Stage B用於壓縮圖像，類似於Stable Diffusion中VAE的作用。然而，通過這種設置，可以實現更高的圖像壓縮率。Stable Diffusion模型使用8的空間壓縮因子，將分辨率為1024 x 1024的圖像編碼為128 x 128，而Stable Cascade實現了42的壓縮因子，將1024 x 1024的圖像編碼為24 x 24，同時能夠準確解碼圖像。這帶來了更低的訓練和推理成本的巨大優勢。此外，Stage C負責根據文本提示生成24 x 24的小潛在空間圖像。

本次發佈提供了Stage C的兩個檢查點、Stage B的兩個檢查點和Stage A的一個檢查點。Stage C有10億和36億參數的版本，我們強烈建議使用36億參數的版本，因為大部分微調工作都集中在這個版本上。Stage B的兩個版本分別有7億和15億參數，兩者都能取得很好的效果，但15億參數的版本在重建小細節方面表現更出色。因此，使用每個階段的較大變體可以獲得最佳效果。最後，Stage A包含2000萬參數，由於其規模較小，參數是固定的。

評估

根據評估，Stable Cascade在幾乎所有比較中，在提示對齊和美學質量方面表現最佳。下圖展示了使用混合的parti提示（鏈接）和美學提示進行的人工評估結果。具體來說，將Stable Cascade（30次推理步驟）與Playground v2（50次推理步驟）、SDXL（50次推理步驟）、SDXL Turbo（1次推理步驟）和Würstchen v2（30次推理步驟）進行了比較。