Stable Diffusion 3.5 Large开源图像生成模型

首页

Stable Diffusion 3.5 Large

由 stabilityai 开发

基于多模态扩散Transformer架构的文本生成图像模型，在图像质量、排版效果和复杂提示理解方面有显著提升

文本生成图像英语开源协议:其他 #多模态扩散Transformer #高精度文本生成图像 #复杂排版支持

下载量 143.20k

发布时间 : 10/22/2024

模型简介

可根据文本提示生成高质量图像，适用于创意设计、教育工具开发等场景

模型特点

多模态扩散Transformer架构

采用MMDiT架构，集成三个固定预训练文本编码器，提升图像生成质量

QK归一化技术

增强训练稳定性，提高模型性能

多文本编码器支持

支持CLIP系列和T5文本编码器，增强文本理解能力

资源效率优化

提供量化部署方案，降低显存占用

模型能力

文本生成图像

复杂提示理解

高质量图像生成

排版效果优化

使用案例

创意设计

艺术创作

根据文本描述生成艺术作品

高质量的艺术图像

设计辅助

为设计师提供创意灵感

多样化的设计概念

教育工具

教育内容生成

为教育工具生成图像内容

丰富的教育素材

研究

生成模型研究

用于文本到图像生成模型的研究

先进的模型架构和技术

🚀 稳定扩散3.5大模型

稳定扩散3.5大模型是一款多模态扩散变换器（MMDiT）文本到图像生成模型，在图像质量、排版、复杂提示理解和资源效率方面表现出色，能根据文本提示生成高质量图像。

🚀 快速开始

安装依赖

升级到最新版本的 🧨 diffusers库

pip install -U diffusers

运行示例代码

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

image = pipe(
    "A capybara holding a sign that reads Hello World",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]
image.save("capybara.png")

✨ 主要特性

稳定扩散3.5大模型是一款多模态扩散变换器（MMDiT）文本到图像模型，在图像质量、排版、复杂提示理解和资源效率方面性能有所提升。

📦 安装指南

安装diffusers库

pip install -U diffusers

安装bitsandbytes库（用于模型量化）

pip install bitsandbytes

💻 使用示例

基础用法

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

image = pipe(
    "A capybara holding a sign that reads Hello World",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]
image.save("capybara.png")

高级用法

模型量化

from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
from diffusers import StableDiffusion3Pipeline
import torch

model_id = "stabilityai/stable-diffusion-3.5-large"

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model_nf4 = SD3Transformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16
)

pipeline = StableDiffusion3Pipeline.from_pretrained(
    model_id, 
    transformer=model_nf4,
    torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()

prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature's body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree.  As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"

image = pipeline(
    prompt=prompt,
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]
image.save("whimsical.png")

微调

请参考微调指南。

📚 详细文档

模型描述

开发者： Stability AI
模型类型： MMDiT文本到图像生成模型
模型说明： 该模型根据文本提示生成图像。它是一个多模态扩散变换器，使用三个固定的预训练文本编码器，并采用QK归一化来提高训练稳定性。

许可证

社区许可证： 免费用于研究、非商业用途，以及年收入低于100万美元的组织或个人的商业用途。更多详情请见社区许可协议。请访问 Stability AI 了解更多信息，或联系我们了解商业许可详情。

模型来源

ComfyUI： Github，示例工作流
Huggingface Space： Space
Diffusers：见下文
GitHub：GitHub
API端点：

模型性能

请参阅博客了解我们关于提示遵循度和美学质量的比较性能研究。

文件结构

点击此处访问文件和版本标签

│
├── text_encoders/  
│   ├── README.md
│   ├── clip_g.safetensors
│   ├── clip_l.safetensors
│   ├── t5xxl_fp16.safetensors
│   └── t5xxl_fp8_e4m3fn.safetensors
│
├── README.md
├── LICENSE
├── sd3_large.safetensors
├── SD3.5L_example_workflow.json
└── sd3_large_demo.png

** 以下文件结构用于diffusers集成 **
├── scheduler/
├── text_encoder/
├── text_encoder_2/
├── text_encoder_3/
├── tokenizer/
├── tokenizer_2/
├── tokenizer_3/
├── transformer/
├── vae/
└── model_index.json

使用方式

预期用途

预期用途包括以下方面：

生成艺术作品，并用于设计和其他艺术创作过程。
用于教育或创意工具。
对生成模型进行研究，包括了解生成模型的局限性。

模型的所有使用必须符合我们的可接受使用政策。

非预期用途

该模型并非用于生成事实性或真实反映人物或事件的内容。因此，使用该模型生成此类内容超出了该模型的能力范围。

安全

作为我们以安全为设计理念和负责任的人工智能部署方法的一部分，我们采取了深思熟虑的措施，确保从开发的早期阶段就保证模型的完整性。我们在模型开发的整个过程中实施了安全措施。我们已经实施了安全缓解措施，旨在降低某些危害的风险，但我们建议开发人员根据其特定用例进行自己的测试并应用额外的缓解措施。如需了解更多关于我们的安全方法，请访问我们的安全页面。

完整性评估

我们的完整性评估方法包括结构化评估和针对某些危害的红队测试。测试主要以英语进行，可能无法涵盖所有可能的危害。

已识别的风险和缓解措施

有害内容： 我们在训练模型时使用了经过过滤的数据集，并实施了保障措施，试图在实用性和防止危害之间取得适当的平衡。然而，这并不能保证所有可能的有害内容都已被去除。所有开发人员和部署人员应谨慎行事，并根据其特定的产品政策和应用用例实施内容安全防护措施。
滥用： 技术限制以及对开发人员和最终用户的教育有助于减轻模型的恶意应用。所有用户都必须遵守我们的可接受使用政策，包括在应用微调和平提示工程机制时。请参考Stability AI可接受使用政策，了解我们产品的违规使用信息。
隐私侵犯： 鼓励开发人员和部署人员采用尊重数据隐私的技术，遵守隐私法规。

联系我们

请报告模型的任何问题或联系我们：

安全问题： safety@stability.ai
安全漏洞： security@stability.ai
隐私问题： privacy@stability.ai
许可证和一般问题： https://stability.ai/license
企业许可证： https://stability.ai/enterprise

🔧 技术细节

实现细节

QK归一化： 实现QK归一化技术以提高训练稳定性。
文本编码器：
- CLIPs：OpenCLIP-ViT/G，CLIP-ViT/L，上下文长度77个标记
- T5：T5-xxl，在训练的不同阶段上下文长度为77/256个标记
训练数据和策略： 该模型在多种数据上进行训练，包括合成数据和经过过滤的公开可用数据。