pseudo-flex-base開源攝影模型 - 基於SD2.1微調，支持多比例圖像生成

首頁

Pseudo Flex Base

由bghira開發

基於Stable Diffusion 2.1微調的多比例攝影模型，支持動態分辨率圖像生成

圖像生成開源協議:Openrail #多比例攝影 #高分辨率生成 #寫實風格

下載量 70

發布時間 : 6/25/2023

模型概述

這是一個基於stable-diffusion-2-1微調的多比例攝影模型，專門優化了非標準比例圖像的生成質量，解決了傳統模型在寬幅/豎幅比例下生成效果異常的問題。

模型特點

多比例支持

通過比例分桶技術優化了非方形比例(如16:9,4:3等)的圖像生成質量

高分辨率生成

基礎分辨率為1024x1024，支持更高分辨率的圖像生成

對比度優化

採用偏移噪聲與SNR伽馬技術改善圖像對比度問題

多樣化數據集

融合了柯達彩色幻燈片、Midjourney圖像和國家地理等多源高質量數據

模型能力

文本生成圖像

高分辨率圖像生成

多比例圖像生成

寫實風格圖像生成

使用案例

攝影藝術

人像攝影

生成各種比例的高質量人像照片

可生成不同比例(1:1,4:3,16:9等)的自然人像

風景攝影

生成寬幅自然風光圖像

適合生成16:9等寬幅比例的風景照片

創意設計

廣告素材

生成符合各種廣告版式要求的圖像

支持不同比例的廣告素材生成

🚀 偽靈活基礎模型（1024x1024 基礎分辨率）

該模型是對 stable-diffusion-2-1 進行微調得到的攝影模型，支持不同的寬高比，能有效解決生成圖像裁剪感和非方形圖像生成效果不佳等問題。

🚀 快速開始

使用以下代碼開始使用該模型：

# 使用 PyTorch 2！
import torch
from diffusers import StableDiffusionPipeline, DiffusionPipeline, AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel

# 任何當前在 Huggingface Hub 上的模型。
model_id = 'ptx0/pseudo-flex-base'
pipeline = DiffusionPipeline.from_pretrained(model_id)

# 優化！
pipeline.unet = torch.compile(pipeline.unet)
scheduler = DDPMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler"
)

# 如果出現錯誤，請移除這行代碼。
torch.set_float32_matmul_precision('high')

pipeline.to('cuda')
prompts = {
    "woman": "a woman, hanging out on the beach",
    "man": "a man playing guitar in a park",
    "lion": "Explore the ++majestic beauty++ of untamed ++lion prides++ as they roam the African plains --captivating expressions-- in the wildest national geographic adventure",
    "child": "a child flying a kite on a sunny day",
    "bear": "best quality ((bear)) in the swiss alps cinematic 8k highly detailed sharp focus intricate fur",
    "alien": "an alien exploring the Mars surface",
    "robot": "a robot serving coffee in a cafe",
    "knight": "a knight protecting a castle",
    "menn": "a group of smiling and happy men",
    "bicycle": "a bicycle, on a mountainside, on a sunny day",
    "cosmic": "cosmic entity, sitting in an impossible position, quantum reality, colours",
    "wizard": "a mage wizard, bearded and gray hair, blue  star hat with wand and mystical haze",
    "wizarddd": "digital art, fantasy, portrait of an old wizard, detailed",
    "macro": "a dramatic city-scape at sunset or sunrise",
    "micro": "RNA and other molecular machinery of life",
    "gecko": "a leopard gecko stalking a cricket"
}
for shortname, prompt in prompts.items():
    # 舊提示：''
    image = pipeline(prompt=prompt,
        negative_prompt='malformed, disgusting, overexposed, washed-out',
        num_inference_steps=32, generator=torch.Generator(device='cuda').manual_seed(1641421826), 
        width=1368, height=720, guidance_scale=7.5, guidance_rescale=0.3, num_inference_steps=25).images[0]
    image.save(f'test/{shortname}_nobetas.png', format="PNG")

✨ 主要特性

基於 stable-diffusion-2-1 微調，支持不同寬高比，生成攝影風格圖像。
解決了生成圖像裁剪感和非方形圖像生成效果不佳的問題。

📦 安裝指南

所有預處理工作通過 GitHub 上 bghira/SimpleTuner 中的腳本完成。

💻 使用示例

基礎用法

# 使用 PyTorch 2！
import torch
from diffusers import StableDiffusionPipeline, DiffusionPipeline, AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel

# 任何當前在 Huggingface Hub 上的模型。
model_id = 'ptx0/pseudo-flex-base'
pipeline = DiffusionPipeline.from_pretrained(model_id)

# 優化！
pipeline.unet = torch.compile(pipeline.unet)
scheduler = DDPMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler"
)

# 如果出現錯誤，請移除這行代碼。
torch.set_float32_matmul_precision('high')

pipeline.to('cuda')
prompts = {
    "woman": "a woman, hanging out on the beach",
    "man": "a man playing guitar in a park",
    "lion": "Explore the ++majestic beauty++ of untamed ++lion prides++ as they roam the African plains --captivating expressions-- in the wildest national geographic adventure",
    "child": "a child flying a kite on a sunny day",
    "bear": "best quality ((bear)) in the swiss alps cinematic 8k highly detailed sharp focus intricate fur",
    "alien": "an alien exploring the Mars surface",
    "robot": "a robot serving coffee in a cafe",
    "knight": "a knight protecting a castle",
    "menn": "a group of smiling and happy men",
    "bicycle": "a bicycle, on a mountainside, on a sunny day",
    "cosmic": "cosmic entity, sitting in an impossible position, quantum reality, colours",
    "wizard": "a mage wizard, bearded and gray hair, blue  star hat with wand and mystical haze",
    "wizarddd": "digital art, fantasy, portrait of an old wizard, detailed",
    "macro": "a dramatic city-scape at sunset or sunrise",
    "micro": "RNA and other molecular machinery of life",
    "gecko": "a leopard gecko stalking a cricket"
}
for shortname, prompt in prompts.items():
    # 舊提示：''
    image = pipeline(prompt=prompt,
        negative_prompt='malformed, disgusting, overexposed, washed-out',
        num_inference_steps=32, generator=torch.Generator(device='cuda').manual_seed(1641421826), 
        width=1368, height=720, guidance_scale=7.5, guidance_rescale=0.3, num_inference_steps=25).images[0]
    image.save(f'test/{shortname}_nobetas.png', format="PNG")

📚 詳細文檔

模型詳情

模型描述

對 stable-diffusion-2-1 進行微調，以支持動態寬高比。微調分辨率如下：

	寬度	高度	寬高比	圖像數量
0	1024	1024	1:1	90561
1	1536	1024	3:2	8716
2	1365	1024	4:3	6933
3	1468	1024	~3:2	113
4	1778	1024	~5:3	6315
5	1200	1024	~5:4	6376
6	1333	1024	~4:3	2814
7	1281	1024	~5:4	52
8	1504	1024	~3:2	139
9	1479	1024	~3:2	25
10	1384	1024	~4:3	1676
11	1370	1024	~4:3	63
12	1499	1024	~3:2	436
13	1376	1024	~4:3	68

其他寬高比的圖像數量較少。數據處理可能不夠簡潔或謹慎，但這是實驗參數的一部分。

開發者：pseudoterminal
模型類型：基於擴散的文本到圖像生成模型
語言：英語
許可證：creativeml-openrail-m
父模型：https://huggingface.co/ptx0/pseudo-real-beta
更多信息資源：需要更多信息

用途

詳情請見：https://huggingface.co/stabilityai/stable-diffusion-2-1

訓練詳情

訓練數據

LAION HD 數據集子集
- https://huggingface.co/datasets/laion/laion-high-resolution 我們僅使用了其中的一小部分，詳見預處理。

預處理

所有預處理工作通過 GitHub 上 bghira/SimpleTuner 中的腳本完成。

速度、大小、時間

數據集大小：過濾後為 100k 圖像 - 文本對。
硬件：1 塊 A100 80G GPU
優化器：8bit Adam
批量大小：150
- 實際批量大小：15
- 梯度累積步數：10
- 有效批量大小：150
學習率：常數 4e-8，隨時間通過減小批量大小進行調整。
訓練步數：進行中（持續更新）
訓練時間：到目前為止約 4 天

模型卡作者

pseudoterminal

🔧 技術細節

背景

ptx0/pseudo-real-beta 預訓練檢查點在多樣化數據集上進行訓練，Unet 訓練 4200 步，文本編碼器訓練 15600 步，批量大小為 15，梯度累積次數為 10。數據集包括：

cushman（1939 年至 1969 年的 8000 張柯達彩色幻燈片）
midjourney v5.1 過濾後的數據（約 22000 張放大的 v5.1 圖像）
《國家地理》（約 3 - 4000 張分辨率大於 1024x768 的動物、野生動物、風景、歷史圖像）
一小部分人物吸菸/ vaping 的庫存圖像

該模型具有生成逼真攝影和冒險風格圖像的能力，且提示一致性強，但缺乏多寬高比處理能力。

訓練代碼

在訓練循環數據加載器中添加了全面的寬高比分組支持，丟棄所有小於 1024x1024 的圖像，並將所有圖像調整為短邊為 1024。根據圖像的寬高比確定另一維度的新長度。所有批次的圖像分辨率相同，相同寬高比的不同分辨率圖像都調整為 1024x... 或 ...x1024。例如，1920x1080 的圖像約調整為 1820x1024。

起始檢查點

pseudo-flex-base 模型通過對 stabilityai/stable-diffusion-2-1 768 基礎模型的凍結文本編碼器進行微調得到，在 LAION HD 的 148000 張圖像上訓練 1000 步，使用 TEXT 字段作為圖像的標題。批量大小實際上再次為 150（批量大小 15，梯度累積 10 次）。在非常高的分辨率下，訓練速度非常慢，寬高比為 1.5 - 1.7 時，在 A100 80G 上每次迭代約需 700 秒。整個訓練過程持續了兩天。

文本編碼器交換

在 1000 步時，實驗性地使用 ptx0/pseudo-real-beta 的文本編碼器與該模型的 Unet 結合，以解決一些殘留的圖像噪聲問題，如像素化。結果證明這是有效的。訓練從檢查點 1000 開始，使用新的文本編碼器重新啟動。

寬/豎屏寬高比的出現

在 1300 到 2950 步之間，驗證提示開始“整合”。一些檢查點出現了性能下降，但通常在約 100 步內得到解決。儘管有下降情況，但總體上仍有改進。

圖像質量下降和數據集交換

由於在 148000 張圖像上以批量大小 150 進行了 3000 步的訓練，圖像開始出現質量下降。這可能是因為數據集中的所有圖像都被重複使用了 3 次，而且考慮到一些圖像過濾器丟棄了約 50000 張圖像，在超低學習率下，每張圖像實際上被使用了 9 次。這導致了以下問題：

圖像開始出現靜態噪聲。
訓練時間過長，每個檢查點的改進很小。
對提示詞彙過擬合，缺乏泛化能力。

因此，在 1300 步時，決定停止在原始 LAION HD 數據集上的訓練，轉而在新獲取的高分辨率 Midjourney v5.1 數據子集上進行訓練。該子集包含 17800 張基礎分辨率為 1024x1024 的圖像，其中約 700 張為豎屏，700 張為橫屏。

對比度問題

在測試檢查點 3275 時，發現較暗的圖像變得模糊，較亮的圖像效果不佳。測試了各種 CFG 縮放和引導級別，最佳的暗圖像效果出現在 guidance_scale = 9.2 和 guidance_rescale = 0.0 時，但圖像仍然“模糊”。

第二次數據集更改

準備了一個新的 LAION 子集，包含唯一圖像且沒有方形圖像，僅包含有限的寬高比：

16:9
9:16
2:3
3:2

這旨在加快模型的學習速度，並防止對標題過擬合。該 LAION 子集包含 17800 張圖像，寬高比分佈均勻。然後使用 T5 Flan 和 BLIP2 對圖像進行標題標註，以獲得高精度的結果。

對比度修復：偏移噪聲 / SNR gamma 的作用？

在檢查點 4250 上實驗性地應用了偏移噪聲和 SNR gamma：

snr_gamma = 5.0
noise_offset = 0.2
noise_pertubation = 0.1

在訓練 25 步內，對比度恢復，提示 a solid black square 再次產生了合理的結果。在偏移噪聲訓練 50 步時，效果明顯改善，a solid black square 的變形最少。第 75 步的檢查點出現問題，SNR gamma 計算導致數值不穩定，因此禁用了該參數，偏移噪聲參數保持不變。