Stable Diffusion 3.5大模型 - 開源免費實現高質量文生圖，效果顯著提升！

首頁

Stable Diffusion V3 5 Large GGUF

由gpustack開發

Stable Diffusion 3.5大模型是一款多模態擴散變換器(MMDiT)文生圖模型，在圖像質量、文字排版、複雜提示詞理解和資源效率方面均有顯著提升。

文本生成圖像英語開源協議:其他 #多模態擴散變換器 #高精度文本生圖 #複雜提示理解

下載量 13.33k

發布時間 : 11/11/2024

模型概述

基於多模態擴散變換器架構的文生圖模型，支持高質量圖像生成和複雜文本理解

模型特點

多模態擴散變換器架構

採用創新的MMDiT架構，結合多個預訓練文本編碼器，提升圖像生成質量

QK歸一化技術

使用QK歸一化技術顯著提升訓練穩定性

多文本編碼器支持

整合OpenCLIP-ViT/G、CLIP-ViT/L和T5-xxl三種文本編碼器，增強文本理解能力

高效資源利用

提供多種量化選項，可在不同硬件配置上高效運行

模型能力

文本到圖像生成

複雜提示理解

高質量圖像合成

文字排版生成

使用案例

藝術創作

概念藝術創作

為遊戲、電影等媒體創作概念藝術和設計素材

生成具有特定風格和主題的高質量藝術作品

插畫生成

根據文字描述自動生成插畫

快速產出符合需求的視覺內容

設計與營銷

廣告素材生成

為營銷活動快速生成視覺素材

提高創意產出效率，降低製作成本

教育與研究

生成模型研究

用於研究擴散模型的行為和侷限性

推動生成式AI技術進步

🚀 stable-diffusion-v3-5-large-GGUF

stable-diffusion-v3-5-large-GGUF 是一個文本到圖像的生成模型，基於 Stable Diffusion 3.5 large 進行 GGUF 量化，可將文本描述轉化為圖像，在圖像質量、排版、複雜提示理解和資源效率方面表現出色。

🚀 快速開始

此模型實驗性地由 gpustack/llama-box v0.0.75+ 支持。

✨ 主要特性

多模態擴散變壓器架構：採用 MMDiT 架構，使用三個固定的預訓練文本編碼器，結合 QK 歸一化技術，提升訓練穩定性。
多種量化選項：提供 FP16、Q8_0、Q4_1、Q4_0 等多種量化方式，滿足不同場景需求。
廣泛的應用場景：可用於藝術創作、教育、創意工具以及生成模型研究等領域。

📦 安裝指南

升級到最新版本的 diffusers 庫

pip install -U diffusers

安裝 bitsandbytes 以進行模型量化

pip install bitsandbytes

💻 使用示例

基礎用法

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

image = pipe(
    "A capybara holding a sign that reads Hello World",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]
image.save("capybara.png")

高級用法

from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
from diffusers import StableDiffusion3Pipeline
import torch

model_id = "stabilityai/stable-diffusion-3.5-large"

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model_nf4 = SD3Transformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16
)

pipeline = StableDiffusion3Pipeline.from_pretrained(
    model_id, 
    transformer=model_nf4,
    torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()

prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature's body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree.  As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"

image = pipeline(
    prompt=prompt,
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]
image.save("whimsical.png")

📚 詳細文檔

模型描述

屬性	詳情
模型開發者	Stability AI
模型類型	MMDiT 文本到圖像生成模型
模型說明	該模型基於文本提示生成圖像，是一個多模態擴散變壓器，使用三個固定的預訓練文本編碼器，並採用 QK 歸一化來提高訓練穩定性。

許可證

社區許可證：適用於研究、非商業用途，以及年總收入低於 100 萬美元的組織或個人。更多詳情請參閱社區許可協議。請訪問 https://stability.ai/license 瞭解更多信息。
企業許可證：年總收入超過 100 萬美元的個人或組織，請聯繫我們獲取企業許可證。

模型來源

ComfyUI：Github，示例工作流
Huggingface Space：Space
Diffusers：見下文
GitHub：GitHub
API 端點：

文件結構

點擊此處訪問文件和版本標籤

│
├── text_encoders/  
│   ├── README.md
│   ├── clip_g.safetensors
│   ├── clip_l.safetensors
│   ├── t5xxl_fp16.safetensors
│   └── t5xxl_fp8_e4m3fn.safetensors
│
├── README.md
├── LICENSE
├── sd3_large.safetensors
├── SD3.5L_example_workflow.json
└── sd3_large_demo.png

** 以下文件結構用於 diffusers 集成 **
├── scheduler/
├── text_encoder/
├── text_encoder_2/
├── text_encoder_3/
├── tokenizer/
├── tokenizer_2/
├── tokenizer_3/
├── transformer/
├── vae/
└── model_index.json

微調

請參閱微調指南。

預期用途

生成藝術作品，並用於設計和其他藝術創作過程。
應用於教育或創意工具。
進行生成模型研究，包括瞭解生成模型的侷限性。

所有對模型的使用都必須符合我們的可接受使用政策。

非預期用途

該模型並非用於生成事實性或真實反映人物或事件的內容。因此，使用該模型生成此類內容超出了其能力範圍。

安全

我們採取了一系列措施來確保模型的安全性和可靠性。在模型開發的各個階段都實施了安全措施，以降低潛在風險。然而，我們建議開發者根據具體用例進行自己的測試，並採取額外的緩解措施。

完整性評估

我們的完整性評估方法包括結構化評估和針對特定危害的紅隊測試。測試主要以英語進行，可能無法涵蓋所有可能的危害。

已識別的風險和緩解措施

有害內容：我們在訓練模型時使用了過濾後的數據集，並實施了保障措施，試圖在實用性和防止危害之間取得平衡。但這並不能保證所有可能的有害內容都已被去除。所有開發者和部署者應謹慎行事，並根據其特定的產品政策和應用用例實施內容安全護欄。
濫用：技術限制以及對開發者和最終用戶的教育可以幫助減輕模型的惡意應用。所有用戶都必須遵守我們的可接受使用政策，包括在應用微調提示工程機制時。請參考 Stability AI 可接受使用政策，瞭解我們產品的違規使用信息。
隱私侵犯：鼓勵開發者和部署者採用尊重數據隱私的技術，遵守隱私法規。

聯繫我們

請報告模型的任何問題或聯繫我們：

安全問題：safety@stability.ai
安全漏洞：security@stability.ai
隱私問題：privacy@stability.ai
許可證和一般問題：https://stability.ai/license
企業許可證：https://stability.ai/enterprise

🔧 技術細節

實現細節

QK 歸一化：實現了 QK 歸一化技術，以提高訓練穩定性。
文本編碼器：
- CLIPs：OpenCLIP-ViT/G，CLIP-ViT/L，上下文長度 77 個標記
- T5：T5-xxl，在訓練的不同階段上下文長度為 77/256 個標記
訓練數據和策略：該模型在多種數據上進行訓練，包括合成數據和經過篩選的公開可用數據。