controlnet-canny-sdxl-1.0オープンソース画像生成モデル - エッジ検出を利用して高精度の高画質画像を生成

ホーム

Controlnet Canny Sdxl 1.0

xinsirによって開発

強力な制御ネットワークモデルで、Midjourneyに匹敵する視覚効果の高解像度画像を生成でき、Cannyエッジ検出による精密な制御が可能です。

画像生成オープンソースライセンス:Apache-2.0 #高解像度画像生成 #Cannyエッジ制御 #Midjourney級画質

ダウンロード数 25.79k

リリース時間 : 5/10/2024

モデル概要

このモデルはStable Diffusion XL 1.0をファインチューニングしており、テキストから画像生成タスクに特化し、特にCannyエッジ図による詳細な高品質画像生成が得意です。

モデル特徴

高品質生成

1000万枚以上の精選画像で訓練され、Midjourneyレベルの生成効果を実現

精密制御

Cannyエッジ検出を用いた構図制御を採用し、複雑なシーン生成をサポート

多スタイル適応

リアル写真とアニメスタイルに対応（ベースモデル切り替え必要）

先進的訓練技術

データ拡張、多重損失、多解像度訓練などの技術でモデル性能を最適化

モデル能力

テキストから画像生成

エッジ図による構図制御

高解像度画像生成

多スタイル画像生成

使用事例

アート創作

コンセプトアートデザイン

ラフスケッチから完全なアートコンセプト図を生成

複雑で華麗なアート構図が生成可能（例中の死者の日テーマなど）

イラスト創作

簡単な線画から完成イラストへ変換

水彩、油絵など多様なアートスタイルに対応（例中のウォーターハウススタイルなど）

商業デザイン

商品展示

商品プロモーション画像を生成

プロ級のフードフォトが生成可能（例中のピザ画像など）

広告デザイン

広告コンセプト図を迅速に生成

季節テーマなど商業シーンに対応（例中の星背景画像など）

🚀 Controlnet-Canny-Sdxl-1.0

このモデルは非常に強力なControlNetで、Midjourneyに匹敵する高解像度の画像を生成できます。大量の高品質データで訓練され、有用なトリックが適用されています。CannyはControlNetシリーズの重要なモデルで、描画やデザイン関連の多くの作業に適用できます。

images

🚀 クイックスタート

このコードを使って、モデルの使用を始めましょう。

from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline, AutoencoderKL
from diffusers import DDIMScheduler, EulerAncestralDiscreteScheduler
from PIL import Image
import torch
import numpy as np
import cv2

def HWC3(x):
    assert x.dtype == np.uint8
    if x.ndim == 2:
        x = x[:, :, None]
    assert x.ndim == 3
    H, W, C = x.shape
    assert C == 1 or C == 3 or C == 4
    if C == 3:
        return x
    if C == 1:
        return np.concatenate([x, x, x], axis=2)
    if C == 4:
        color = x[:, :, 0:3].astype(np.float32)
        alpha = x[:, :, 3:4].astype(np.float32) / 255.0
        y = color * alpha + 255.0 * (1.0 - alpha)
        y = y.clip(0, 255).astype(np.uint8)
        return y

controlnet_conditioning_scale = 1.0  
prompt = "your prompt, the longer the better, you can describe it as detail as possible"
negative_prompt = 'longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality'

eulera_scheduler = EulerAncestralDiscreteScheduler.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler")

controlnet = ControlNetModel.from_pretrained(
    "xinsir/controlnet-canny-sdxl-1.0",
    torch_dtype=torch.float16
)

# when test with other base model, you need to change the vae also.
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)

pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    vae=vae,
    safety_checker=None,
    torch_dtype=torch.float16,
    scheduler=eulera_scheduler,
)

# need to resize the image resolution to 1024 * 1024 or same bucket resolution to get the best performance

controlnet_img = cv2.imread("your image path")
height, width, _  = controlnet_img.shape
ratio = np.sqrt(1024. * 1024. / (width * height))
new_width, new_height = int(width * ratio), int(height * ratio)
controlnet_img = cv2.resize(controlnet_img, (new_width, new_height))

controlnet_img = cv2.Canny(controlnet_img, 100, 200)
controlnet_img = HWC3(controlnet_img)
controlnet_img = Image.fromarray(controlnet_img)

images = pipe(
    prompt,
    negative_prompt=negative_prompt,
    image=controlnet_img,
    controlnet_conditioning_scale=controlnet_conditioning_scale,
    width=new_width,
    height=new_height,
    num_inference_steps=30,
    ).images

images[0].save(f"your image save path, png format is usually better than jpg or webp in terms of image quality but got much bigger")

✨ 主な機能

大量の高品質データ（1000万枚以上の画像）で訓練され、慎重にフィルタリングされキャプション付けされました。
訓練中にデータ拡張、複数の損失関数、多解像度などの有用なトリックが適用されました。
1段階の訓練で、他のオープンソースのCannyモデル（[diffusers/controlnet-canny-sdxl-1.0]、[TheMistoAI/MistoLine]）を上回る性能を発揮します。

📚 ドキュメント

モデルの詳細

モデルの説明

開発者: xinsir
モデルの種類: ControlNet_SDXL
ライセンス: apache-2.0
ファインチューニング元のモデル [オプション]: stabilityai/stable-diffusion-xl-base-1.0

モデルのソース [オプション]

論文 [オプション]: https://arxiv.org/abs/2302.05543

使用方法

例

プロンプト: A closeup of two day of the dead models, looking to the side, large flowered headdress, full dia de Los muertoe make up, lush red lips, butterflies, flowers, pastel colors, looking to the side, jungle, birds, color harmony , extremely detailed, intricate, ornate, motion, stunning, beautiful, unique, soft lighting
プロンプト: ghost with a plague doctor mask in a venice carnaval hyper realistic
プロンプト: A picture surrounded by blue stars and gold stars, glowing, dark navy blue and gray tones, distributed in light silver and gold, playful, festive atmosphere, pure fabric, chalk, FHD 8K
プロンプト: Delicious vegetarian pizza with champignon mushrooms, tomatoes, mozzarella, peppers and black olives, isolated on white background , transparent isolated white background , top down view, studio photo, transparent png, Clean sharp focus. High end retouching. Food magazine photography. Award winning photography. Advertising photography. Commercial photography
プロンプト: a blonde woman in a wedding dress in a maple forest in summer with a flower crown laurel. Watercolor painting in the style of John William Waterhouse. Romanticism. Ethereal light.

アニメの例 (ベースモデルをCounterfeitXLに変更する必要があります。それ以外は同じです)

images_5) images_6) images_7) images_8) images_9)

評価指標

Laion Aesthetic Score [https://laion.ai/blog/laion-aesthetics/]
PerceptualSimilarity [https://github.com/richzhang/PerceptualSimilarity]

評価データ

このプロジェクトの目的は、人々がMidjourneyのように画像を描けるようにすることなので、テストデータはMidjourneyのアップスケール画像からプロンプト付きでランダムにサンプリングされています。Midjourneyのユーザーには多くの専門デザイナーが含まれており、アップスケール画像は美しさのスコアとプロンプトの一致性が高い傾向があるため、ControlNetの能力を判断するテストセットとして適しています。300のプロンプト - 画像のペアをランダムに選択し、各プロンプトにつき4枚の画像を生成し、合計1200枚の画像を生成しました。Laion Aesthetic Scoreを計算して美しさを測定し、PerceptualSimilarityを計算して制御能力を測定しました。画像の品質は指標値と良い一致性があることがわかりました。他のSOTAのHugging Faceモデルと比較し、結果を以下に示します。私たちのモデルは最高の美学スコアを持ち、適切なプロンプトを与えると視覚的に魅力的な画像を生成できます。

定量的な結果

指標	xinsir/controlnet-canny-sdxl-1.0	diffusers/controlnet-canny-sdxl-1.0	TheMistoAI/MistoLine
laion_aesthetic	6.03	5.93	5.82
perceptual similarity	0.4200	0.5053	0.5387

laion_aesthetic（値が高いほど良い）
perceptual similarity（値が低いほど良い）

注: 値はwebp形式で保存されたときに計算されます。png形式で保存すると美学値は0.1 - 0.3上がりますが、相対的な関係は変わりません。

訓練の詳細

このモデルは高品質のデータを使用して訓練され、1段階の訓練で、解像度設定はsdxl-baseと同じ1024*1024です。lvming zhangのようにランダムな閾値を使用してCanny画像を生成し、データ拡張のために適切なハイパーパラメータを見つけることが重要です。簡単すぎたり難しすぎたりするとモデルの性能が低下します。また、ランダムなマスクを使用してCanny画像のランダムな割合をマスクし、モデルにプロンプトと線の間のより多くの意味を学習させます。1000万枚以上の画像を使用し、慎重に注釈付けされています。cogvlmは強力な画像キャプションモデルであることが証明されています[https://github.com/THUDM/CogVLM?tab=readme-ov-file]。漫画画像については、waifu taggerを使用して特別なタグを生成することをお勧めします[https://huggingface.co/spaces/SmilingWolf/wd-tagger]。64台以上のA100を使用してモデルを訓練し、累積勾配バッチを使用すると実際のバッチサイズは2560になります。

訓練データ

データはMidjourney、laion 5B、danbooruなどの多くのソースから構成されています。データは慎重にフィルタリングされ注釈付けされています。

結論

評価では、このモデルはstabilityai/stable-diffusion-xl-base-1.0と比較して、実画像でより良い美学スコアを得ており、漫画スタイルの画像でも同等の性能を発揮します。より強力なデータ拡張とより多くの訓練ステップのため、知覚的類似度でテストした場合、制御能力が優れています。また、異常な人間の構造を含む傾向のある異常な画像を生成する割合が低いです。