Stable Diffusion 3.5 Large開源圖像生成模型

首頁

Stable Diffusion 3.5 Large

由stabilityai開發

基於多模態擴散Transformer架構的文本生成圖像模型，在圖像質量、排版效果和複雜提示理解方面有顯著提升

文本生成圖像英語開源協議:其他 #多模態擴散Transformer #高精度文本生成圖像 #複雜排版支持

下載量 143.20k

發布時間 : 10/22/2024

模型概述

可根據文本提示生成高質量圖像，適用於創意設計、教育工具開發等場景

模型特點

多模態擴散Transformer架構

採用MMDiT架構，集成三個固定預訓練文本編碼器，提升圖像生成質量

QK歸一化技術

增強訓練穩定性，提高模型性能

多文本編碼器支持

支持CLIP系列和T5文本編碼器，增強文本理解能力

資源效率優化

提供量化部署方案，降低顯存佔用

模型能力

文本生成圖像

複雜提示理解

高質量圖像生成

排版效果優化

使用案例

創意設計

藝術創作

根據文本描述生成藝術作品

高質量的藝術圖像

設計輔助

為設計師提供創意靈感

多樣化的設計概念

教育工具

教育內容生成

為教育工具生成圖像內容

豐富的教育素材

研究

生成模型研究

用於文本到圖像生成模型的研究

先進的模型架構和技術

🚀 穩定擴散3.5大模型

穩定擴散3.5大模型是一款多模態擴散變換器（MMDiT）文本到圖像生成模型，在圖像質量、排版、複雜提示理解和資源效率方面表現出色，能根據文本提示生成高質量圖像。

🚀 快速開始

安裝依賴

升級到最新版本的 🧨 diffusers庫

pip install -U diffusers

運行示例代碼

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

image = pipe(
    "A capybara holding a sign that reads Hello World",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]
image.save("capybara.png")

✨ 主要特性

穩定擴散3.5大模型是一款多模態擴散變換器（MMDiT）文本到圖像模型，在圖像質量、排版、複雜提示理解和資源效率方面性能有所提升。

📦 安裝指南

安裝diffusers庫

pip install -U diffusers

安裝bitsandbytes庫（用於模型量化）

pip install bitsandbytes

💻 使用示例

基礎用法

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

image = pipe(
    "A capybara holding a sign that reads Hello World",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]
image.save("capybara.png")

高級用法

模型量化

from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
from diffusers import StableDiffusion3Pipeline
import torch

model_id = "stabilityai/stable-diffusion-3.5-large"

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model_nf4 = SD3Transformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16
)

pipeline = StableDiffusion3Pipeline.from_pretrained(
    model_id, 
    transformer=model_nf4,
    torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()

prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature's body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree.  As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"

image = pipeline(
    prompt=prompt,
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]
image.save("whimsical.png")

微調

請參考微調指南。

📚 詳細文檔

模型描述

開發者： Stability AI
模型類型： MMDiT文本到圖像生成模型
模型說明： 該模型根據文本提示生成圖像。它是一個多模態擴散變換器，使用三個固定的預訓練文本編碼器，並採用QK歸一化來提高訓練穩定性。

許可證

社區許可證： 免費用於研究、非商業用途，以及年收入低於100萬美元的組織或個人的商業用途。更多詳情請見社區許可協議。請訪問 Stability AI 瞭解更多信息，或聯繫我們瞭解商業許可詳情。

模型來源

ComfyUI： Github，示例工作流
Huggingface Space： Space
Diffusers：見下文
GitHub：GitHub
API端點：

模型性能

請參閱博客瞭解我們關於提示遵循度和美學質量的比較性能研究。

文件結構

點擊此處訪問文件和版本標籤

│
├── text_encoders/  
│   ├── README.md
│   ├── clip_g.safetensors
│   ├── clip_l.safetensors
│   ├── t5xxl_fp16.safetensors
│   └── t5xxl_fp8_e4m3fn.safetensors
│
├── README.md
├── LICENSE
├── sd3_large.safetensors
├── SD3.5L_example_workflow.json
└── sd3_large_demo.png

** 以下文件結構用於diffusers集成 **
├── scheduler/
├── text_encoder/
├── text_encoder_2/
├── text_encoder_3/
├── tokenizer/
├── tokenizer_2/
├── tokenizer_3/
├── transformer/
├── vae/
└── model_index.json

使用方式

預期用途

預期用途包括以下方面：

生成藝術作品，並用於設計和其他藝術創作過程。
用於教育或創意工具。
對生成模型進行研究，包括瞭解生成模型的侷限性。

模型的所有使用必須符合我們的可接受使用政策。

非預期用途

該模型並非用於生成事實性或真實反映人物或事件的內容。因此，使用該模型生成此類內容超出了該模型的能力範圍。

安全

作為我們以安全為設計理念和負責任的人工智能部署方法的一部分，我們採取了深思熟慮的措施，確保從開發的早期階段就保證模型的完整性。我們在模型開發的整個過程中實施了安全措施。我們已經實施了安全緩解措施，旨在降低某些危害的風險，但我們建議開發人員根據其特定用例進行自己的測試並應用額外的緩解措施。如需瞭解更多關於我們的安全方法，請訪問我們的安全頁面。

完整性評估

我們的完整性評估方法包括結構化評估和針對某些危害的紅隊測試。測試主要以英語進行，可能無法涵蓋所有可能的危害。

已識別的風險和緩解措施

有害內容： 我們在訓練模型時使用了經過過濾的數據集，並實施了保障措施，試圖在實用性和防止危害之間取得適當的平衡。然而，這並不能保證所有可能的有害內容都已被去除。所有開發人員和部署人員應謹慎行事，並根據其特定的產品政策和應用用例實施內容安全防護措施。
濫用： 技術限制以及對開發人員和最終用戶的教育有助於減輕模型的惡意應用。所有用戶都必須遵守我們的可接受使用政策，包括在應用微調和平提示工程機制時。請參考Stability AI可接受使用政策，瞭解我們產品的違規使用信息。
隱私侵犯： 鼓勵開發人員和部署人員採用尊重數據隱私的技術，遵守隱私法規。

聯繫我們

請報告模型的任何問題或聯繫我們：

安全問題： safety@stability.ai
安全漏洞： security@stability.ai
隱私問題： privacy@stability.ai
許可證和一般問題： https://stability.ai/license
企業許可證： https://stability.ai/enterprise

🔧 技術細節

實現細節

QK歸一化： 實現QK歸一化技術以提高訓練穩定性。
文本編碼器：
- CLIPs：OpenCLIP-ViT/G，CLIP-ViT/L，上下文長度77個標記
- T5：T5-xxl，在訓練的不同階段上下文長度為77/256個標記
訓練數據和策略： 該模型在多種數據上進行訓練，包括合成數據和經過過濾的公開可用數據。