Sotediffusion-v2開源文生圖模型 - 免費生成高質量動漫風格插圖

首頁

Sotediffusion V2

由Disty0開發

基於Würstchen V3/Stable Cascade架構的動漫風格文生圖模型，專攻高質量動漫插圖生成

圖像生成英語開源協議:其他 #動漫風格文生圖 #高分辨率生成 #Würstchen架構優化

下載量 161

發布時間 : 8/7/2024

模型概述

這是一個經過精細調校的文生圖模型，專注於生成具有極致美學的動漫風格圖像。模型在1200萬組圖文對數據上訓練，支持高分辨率輸出和精細的風格控制。

模型特點

高分辨率輸出

支持最高2048x2048分辨率的圖像生成，適合商業級插畫需求

動漫風格優化

專門針對動漫風格進行調優，能生成具有極致美學的角色形象

兩階段採樣

採用28+14步的兩階段採樣策略，平衡生成速度與質量

精細標籤控制

支持WD標籤體系，可精確控制年代風格、美學評分等內容特徵

模型能力

動漫風格圖像生成

高分辨率圖像生成

風格控制

細節優化

負面提示詞控制

使用案例

數字藝術創作

動漫角色設計

快速生成具有統一風格的動漫角色概念圖

可生成細節豐富的角色立繪，包括服裝、表情等特徵

插畫創作

輔助藝術家完成商業插畫的草圖和細節完善

可輸出適合印刷的高分辨率圖像

內容生產

社交媒體內容

批量生成風格統一的社交媒體配圖

快速產出符合平臺要求的視覺內容

🚀 SoteDiffusion V2

SoteDiffusion V2 是對 Würstchen V3 / Stable Cascade 進行的動漫微調模型，可用於文本到圖像的生成，生成具有動漫風格的圖像。

✨ 主要特性

本版本由 fal.ai/grants 贊助發佈。
在 8 塊英偉達 H100 80GB SXM5 GPU 上對 1200 萬對文本和圖像（包含 WD 標籤和自然語言描述）進行了單輪訓練。
使用全 FP32 和 MAE 損失進行訓練。

示例圖片1 示例圖片2

📦 安裝指南

Diffusers

pip install git+https://github.com/huggingface/diffusers

💻 使用示例

基礎用法

import torch
import diffusers

device = "cuda"
dtype = torch.float16
model_path = "Disty0/sotediffusion-v2"
pipe = diffusers.AutoPipelineForText2Image.from_pretrained(model_path, torch_dtype=dtype)

# de-dupe
pipe.decoder_pipe.text_encoder = pipe.text_encoder = None # nothing uses this
del pipe.decoder_pipe.text_encoder
del pipe.prior_prior
del pipe.prior_text_encoder
del pipe.prior_tokenizer
del pipe.prior_scheduler
del pipe.prior_feature_extractor
del pipe.prior_image_encoder

pipe = pipe.to(device, dtype=dtype)
pipe.prior_pipe = pipe.prior_pipe.to(device, dtype=dtype)


def encode_prompt(
    prior_pipe,
    device,
    num_images_per_prompt,
    prompt=""
    ):

    if prompt == "":
        text_inputs = prior_pipe.tokenizer(
            prompt,
            padding="max_length",
            max_length=77,
            truncation=False,
            return_tensors="pt",
        )
        input_ids = text_inputs.input_ids
        attention_mask=None
    else:   
        text_inputs = prior_pipe.tokenizer(
            prompt,
            padding="longest",
            truncation=False,
            return_tensors="pt",
        )
        chunk = []
        padding = []
        max_len = 75
        start_token = text_inputs.input_ids[:,0].unsqueeze(0)
        end_token = text_inputs.input_ids[:,-1].unsqueeze(0)
        raw_input_ids = text_inputs.input_ids[:,1:-1]
        prompt_len = len(raw_input_ids[0])
        last_lenght = prompt_len % max_len
        
        for i in range(int((prompt_len - last_lenght) / max_len)):
            chunk.append(torch.cat([start_token, raw_input_ids[:,i*max_len:(i+1)*max_len], end_token], dim=1))
        for i in range(max_len - last_lenght):
            padding.append(text_inputs.input_ids[:,-1])
        
        last_chunk = torch.cat([raw_input_ids[:,prompt_len-last_lenght:], torch.tensor([padding])], dim=1)
        chunk.append(torch.cat([start_token, last_chunk, end_token], dim=1))
        input_ids = torch.cat(chunk, dim=0)
        attention_mask = torch.ones(input_ids.shape, device=device, dtype=torch.int64)
        attention_mask[-1,last_lenght+1:] = 0

    text_encoder_output = prior_pipe.text_encoder(
        input_ids.to(device), attention_mask=attention_mask, output_hidden_states=True
    )

    prompt_embeds = text_encoder_output.hidden_states[-1].reshape(1,-1,1280)
    prompt_embeds = prompt_embeds.to(dtype=prior_pipe.text_encoder.dtype, device=device)
    prompt_embeds = prompt_embeds.repeat_interleave(num_images_per_prompt, dim=0)

    prompt_embeds_pooled = text_encoder_output.text_embeds[0].unsqueeze(0).unsqueeze(1)
    prompt_embeds_pooled = prompt_embeds_pooled.to(dtype=prior_pipe.text_encoder.dtype, device=device)
    prompt_embeds_pooled = prompt_embeds_pooled.repeat_interleave(num_images_per_prompt, dim=0)

    return prompt_embeds, prompt_embeds_pooled


prompt = "1girl, solo, looking at viewer, open mouth, blue eyes, medium breasts, blonde hair, gloves, dress, bow, hair between eyes, bare shoulders, upper body, hair bow, indoors, elbow gloves, hand on own chest, bridal gauntlets, candlestand, smile, rim lighting, from side, castle interior, looking side,"
quality_prompt = "very aesthetic, best quality, newest"
negative_prompt = "very displeasing, displeasing, worst quality, bad quality, low quality, realistic, monochrome, comic, sketch, oldest, early, artist name, signature, blurry, simple background, upside down,"
num_images_per_prompt=1

# Encode prompts and quality prompts eperately, long prompt support and don't use attention masks for empty prompts:
# pipe, device, num_images_per_prompt, prompt
empty_prompt_embeds, _ = encode_prompt(pipe.prior_pipe, device, num_images_per_prompt, prompt="")

prompt_embeds, prompt_embeds_pooled = encode_prompt(pipe.prior_pipe, device, num_images_per_prompt, prompt=prompt)
quality_prompt_embeds, _ = encode_prompt(pipe.prior_pipe, device, num_images_per_prompt, prompt=quality_prompt)
prompt_embeds = torch.cat([prompt_embeds, quality_prompt_embeds], dim=1)

negative_prompt_embeds, negative_prompt_embeds_pooled = encode_prompt(pipe.prior_pipe, device, num_images_per_prompt, prompt=negative_prompt)

while prompt_embeds.shape[1] < negative_prompt_embeds.shape[1]:
    prompt_embeds = torch.cat([prompt_embeds, empty_prompt_embeds], dim=1)

while negative_prompt_embeds.shape[1] < prompt_embeds.shape[1]:
    negative_prompt_embeds = torch.cat([negative_prompt_embeds, empty_prompt_embeds], dim=1)

output = pipe(
    width=1024,
    height=1536,
    decoder_guidance_scale=1.0,
    prior_guidance_scale=5.0,
    prior_num_inference_steps=28,
    num_inference_steps=14,
    output_type="pil",
    prompt=prompt + " " + quality_prompt,
    negative_prompt=negative_prompt,
    prompt_embeds=prompt_embeds,
    prompt_embeds_pooled=prompt_embeds_pooled,
    negative_prompt_embeds=negative_prompt_embeds,
    negative_prompt_embeds_pooled=negative_prompt_embeds_pooled,
    num_images_per_prompt=num_images_per_prompt,
).images[0]

display(output)

📚 詳細文檔

ComfyUI 使用說明

啟動 ComfyUI 時使用以下參數：--fp16-vae --fp16-unet

下載 Stage C 到 unet 文件夾：sotediffusion-v2-stage_c.safetensors
下載 Stage C 文本編碼器到 clip 文件夾：sotediffusion-v2-stage_c_text_encoder.safetensors
下載 Stage B 到 unet 文件夾：sotediffusion-v2-stage_b.safetensors
下載 Stage A 到 vae 文件夾：stage_a_ft_hq.safetensors

下載工作流並加載：comfyui_workflow.json

Stage C 採樣器：DPMPP 2M 或 DPMPP 2M SDE 搭配 SGM Uniform 調度器
Stage C 步數：28
Stage C 分類器自由引導（CFG）：6.0

Stage B 採樣器：LCM 搭配指數調度器
Stage B 步數：14
Stage B 分類器自由引導（CFG）：1.0

SD.Next 使用說明

URL: https://github.com/vladmandic/automatic/

前往 Models -> Huggingface，在模型名稱中輸入 Disty0/sotediffusion-v2 並點擊下載。
下載完成後加載 Disty0/sotediffusion-v2。

提示詞（Prompt）：

your prompt goes here
very aesthetic, best quality, newest,

（在 SD.Next 中，換行的作用與 BREAK 相同）

負提示詞（Negative Prompt）：

very displeasing, displeasing, worst quality, bad quality, low quality, realistic, monochrome, comic, sketch, oldest, early, artist name, signature, blurry, simple background, upside down,

參數設置：
採樣器（Sampler）：默認

步數（Steps）：28
細化步數（Refiner Steps）：14

分類器自由引導（CFG）：5.0 到 6.0
二次分類器自由引導（Secondary CFG）：1.0 到 1.5

分辨率（Resolution）：1280x1280、1024x1536、1024x2048、2048x1152
只要是 128 的倍數，任何分辨率都可以。

訓練細節

Stage C

基礎模型：Disty0/sotediffusion-wuerstchen3
使用的 GPU：7 塊英偉達 H100 80GB SXM5

參數	值
自動混合精度（amp）	否
權重類型	fp32
保存的權重類型	fp32
分辨率	1024x1024
有效批量大小	84
U-Net 學習率	2e-6
文本編碼器學習率	1e-7
優化器	AdamW 8bit
圖像數量	600 萬張，每張圖像有 2 個描述
訓練輪數	1

Stage B

基礎模型：Disty0/sotediffusion-wuerstchen3-decoder
使用的 GPU：1 塊英偉達 H100 80GB SXM5

參數	值
自動混合精度（amp）	否
權重類型	fp32
保存的權重類型	fp32
分辨率	1024x1024
有效批量大小	8
U-Net 學習率	8e-6
文本編碼器學習率	無
優化器	AdamW
圖像數量	12 萬張
訓練輪數	6

WD 標籤說明

模型按照以下標籤順序進行訓練：

美學標籤, 質量標籤, 日期標籤, 自定義標籤, 評級標籤, 角色, 系列, 其餘標籤

日期標籤

標籤	日期範圍
最新（newest）	2022 年到 2024 年
近期（recent）	2019 年到 2021 年
中期（mid）	2015 年到 2018 年
早期（early）	2011 年到 2014 年
最舊（oldest）	2005 年到 2010 年

美學標籤

使用的模型：shadowlilac/aesthetic-shadow-v2

分數閾值	標籤	數量
0.90	極其美觀（extremely aesthetic）	125451
0.80	非常美觀（very aesthetic）	887382
0.70	美觀（aesthetic）	1049857
0.50	略有美感（slightly aesthetic）	1643091
0.40	不令人反感（not displeasing）	569543
0.30	不美觀（not aesthetic）	445188
0.20	略有不悅感（slightly displeasing）	341424
0.10	令人不悅（displeasing）	237660
其餘	非常令人不悅（very displeasing）	328712

質量標籤

使用的模型：https://huggingface.co/hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth

分數閾值	標籤	數量
0.980	最佳質量（best quality）	1270447
0.900	高質量（high quality）	498244
0.750	優質（great quality）	351006
0.500	中等質量（medium quality）	366448
0.250	普通質量（normal quality）	368380
0.125	質量差（bad quality）	279050
0.025	低質量（low quality）	538958
其餘	最差質量（worst quality）	1955966

評級標籤

標籤	數量
通用（general）	1416451
敏感（sensitive）	3447664
不適宜公開（nsfw）	427459
明確不適宜公開（explicit nsfw）	336925

自定義標籤

數據集名稱	自定義標籤
圖像板塊（image boards）	日期,
文本（text）	文本內容為 "text",
角色（characters）	角色, 系列
Pixiv	作者為 Display_Name,
視覺小說 CG（visual novel cg）	完整視覺小說名稱 (簡稱), 視覺小說 CG,
動漫壁紙（anime wallpaper）	日期, 動漫壁紙,