Stable Diffusion 3.5大模型 - 开源免费实现高质量文生图，效果显著提升！

首页

Stable Diffusion V3 5 Large GGUF

由 gpustack 开发

Stable Diffusion 3.5大模型是一款多模态扩散变换器(MMDiT)文生图模型，在图像质量、文字排版、复杂提示词理解和资源效率方面均有显著提升。

文本生成图像英语开源协议:其他 #多模态扩散变换器 #高精度文本生图 #复杂提示理解

下载量 13.33k

发布时间 : 11/11/2024

模型简介

基于多模态扩散变换器架构的文生图模型，支持高质量图像生成和复杂文本理解

模型特点

多模态扩散变换器架构

采用创新的MMDiT架构，结合多个预训练文本编码器，提升图像生成质量

QK归一化技术

使用QK归一化技术显著提升训练稳定性

多文本编码器支持

整合OpenCLIP-ViT/G、CLIP-ViT/L和T5-xxl三种文本编码器，增强文本理解能力

高效资源利用

提供多种量化选项，可在不同硬件配置上高效运行

模型能力

文本到图像生成

复杂提示理解

高质量图像合成

文字排版生成

使用案例

艺术创作

概念艺术创作

为游戏、电影等媒体创作概念艺术和设计素材

生成具有特定风格和主题的高质量艺术作品

插画生成

根据文字描述自动生成插画

快速产出符合需求的视觉内容

设计与营销

广告素材生成

为营销活动快速生成视觉素材

提高创意产出效率，降低制作成本

教育与研究

生成模型研究

用于研究扩散模型的行为和局限性

推动生成式AI技术进步

🚀 stable-diffusion-v3-5-large-GGUF

stable-diffusion-v3-5-large-GGUF 是一个文本到图像的生成模型，基于 Stable Diffusion 3.5 large 进行 GGUF 量化，可将文本描述转化为图像，在图像质量、排版、复杂提示理解和资源效率方面表现出色。

🚀 快速开始

此模型实验性地由 gpustack/llama-box v0.0.75+ 支持。

✨ 主要特性

多模态扩散变压器架构：采用 MMDiT 架构，使用三个固定的预训练文本编码器，结合 QK 归一化技术，提升训练稳定性。
多种量化选项：提供 FP16、Q8_0、Q4_1、Q4_0 等多种量化方式，满足不同场景需求。
广泛的应用场景：可用于艺术创作、教育、创意工具以及生成模型研究等领域。

📦 安装指南

升级到最新版本的 diffusers 库

pip install -U diffusers

安装 bitsandbytes 以进行模型量化

pip install bitsandbytes

💻 使用示例

基础用法

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

image = pipe(
    "A capybara holding a sign that reads Hello World",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]
image.save("capybara.png")

高级用法

from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
from diffusers import StableDiffusion3Pipeline
import torch

model_id = "stabilityai/stable-diffusion-3.5-large"

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model_nf4 = SD3Transformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16
)

pipeline = StableDiffusion3Pipeline.from_pretrained(
    model_id, 
    transformer=model_nf4,
    torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()

prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature's body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree.  As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"

image = pipeline(
    prompt=prompt,
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]
image.save("whimsical.png")

📚 详细文档

模型描述

属性	详情
模型开发者	Stability AI
模型类型	MMDiT 文本到图像生成模型
模型说明	该模型基于文本提示生成图像，是一个多模态扩散变压器，使用三个固定的预训练文本编码器，并采用 QK 归一化来提高训练稳定性。

许可证

社区许可证：适用于研究、非商业用途，以及年总收入低于 100 万美元的组织或个人。更多详情请参阅社区许可协议。请访问 https://stability.ai/license 了解更多信息。
企业许可证：年总收入超过 100 万美元的个人或组织，请联系我们获取企业许可证。

模型来源

ComfyUI：Github，示例工作流
Huggingface Space：Space
Diffusers：见下文
GitHub：GitHub
API 端点：

文件结构

点击此处访问文件和版本标签

│
├── text_encoders/  
│   ├── README.md
│   ├── clip_g.safetensors
│   ├── clip_l.safetensors
│   ├── t5xxl_fp16.safetensors
│   └── t5xxl_fp8_e4m3fn.safetensors
│
├── README.md
├── LICENSE
├── sd3_large.safetensors
├── SD3.5L_example_workflow.json
└── sd3_large_demo.png

** 以下文件结构用于 diffusers 集成 **
├── scheduler/
├── text_encoder/
├── text_encoder_2/
├── text_encoder_3/
├── tokenizer/
├── tokenizer_2/
├── tokenizer_3/
├── transformer/
├── vae/
└── model_index.json

微调

请参阅微调指南。

预期用途

生成艺术作品，并用于设计和其他艺术创作过程。
应用于教育或创意工具。
进行生成模型研究，包括了解生成模型的局限性。

所有对模型的使用都必须符合我们的可接受使用政策。

非预期用途

该模型并非用于生成事实性或真实反映人物或事件的内容。因此，使用该模型生成此类内容超出了其能力范围。

安全

我们采取了一系列措施来确保模型的安全性和可靠性。在模型开发的各个阶段都实施了安全措施，以降低潜在风险。然而，我们建议开发者根据具体用例进行自己的测试，并采取额外的缓解措施。

完整性评估

我们的完整性评估方法包括结构化评估和针对特定危害的红队测试。测试主要以英语进行，可能无法涵盖所有可能的危害。

已识别的风险和缓解措施

有害内容：我们在训练模型时使用了过滤后的数据集，并实施了保障措施，试图在实用性和防止危害之间取得平衡。但这并不能保证所有可能的有害内容都已被去除。所有开发者和部署者应谨慎行事，并根据其特定的产品政策和应用用例实施内容安全护栏。
滥用：技术限制以及对开发者和最终用户的教育可以帮助减轻模型的恶意应用。所有用户都必须遵守我们的可接受使用政策，包括在应用微调提示工程机制时。请参考 Stability AI 可接受使用政策，了解我们产品的违规使用信息。
隐私侵犯：鼓励开发者和部署者采用尊重数据隐私的技术，遵守隐私法规。

联系我们

请报告模型的任何问题或联系我们：

安全问题：safety@stability.ai
安全漏洞：security@stability.ai
隐私问题：privacy@stability.ai
许可证和一般问题：https://stability.ai/license
企业许可证：https://stability.ai/enterprise

🔧 技术细节

实现细节

QK 归一化：实现了 QK 归一化技术，以提高训练稳定性。
文本编码器：
- CLIPs：OpenCLIP-ViT/G，CLIP-ViT/L，上下文长度 77 个标记
- T5：T5-xxl，在训练的不同阶段上下文长度为 77/256 个标记
训练数据和策略：该模型在多种数据上进行训练，包括合成数据和经过筛选的公开可用数据。