Controlnet-canny-sdxl-1.0 Open-source Image Generation Model - Precisely Generate High-quality Images with Edge Detection

Controlnet Canny Sdxl 1.0

Developed by xinsir

A powerful control network model capable of generating high-resolution images with visual quality comparable to Midjourney, achieving precise control through Canny edge detection.

Image Generation Open Source License:Apache-2.0 #High-resolution image generation #Canny edge control #Midjourney-level quality

Downloads 25.79k

Release Time : 5/10/2024

Model Overview

This model is fine-tuned based on Stable Diffusion XL 1.0, focusing on text-to-image generation tasks, and excels in producing high-quality images with rich details through Canny edge map control.

Model Features

High-quality generation

Trained on over 10 million curated images, achieving Midjourney-level generation quality.

Precise control

Utilizes Canny edge detection for composition control, supporting complex scene generation.

Multi-style adaptation

Supports both photorealistic and anime styles (requires switching base models).

Advanced training techniques

Employs data augmentation, multi-loss functions, and multi-resolution training to optimize model performance.

Model Capabilities

Text-to-image generation

Composition control via edge maps

High-resolution image generation

Multi-style image generation

Use Cases

Artistic creation

Concept art design

Generate complete artistic concept images from sketches.

Capable of producing intricate and elaborate artistic compositions (e.g., Day of the Dead theme in examples).

Illustration creation

Transform simple sketches into complete illustrations.

Supports various artistic styles like watercolor and oil painting (e.g., Waterhouse style in examples).

Commercial design

Product presentation

Generate product promotional images.

Capable of professional-grade food photography (e.g., pizza image in examples).

Advertisement design

Quickly generate advertisement concept images.

Supports commercial scenarios like holiday themes (e.g., starry background in examples).

🚀 Controlnet-Canny-Sdxl-1.0

A powerful ControlNet model that can generate high - resolution images visually comparable with Midjourney, advancing the application of stable diffusion models.

images

🚀 Quick Start

Hello, I am very happy to announce the controlnet - canny - sdxl - 1.0 model, a very powerful controlnet that can generate high resolution images visually comparable with midjourney. The model was trained with a large amount of high - quality data (over 10000000 images), carefully filtered and captioned (using a powerful vllm model). Besides, useful tricks were applied during the training, including data augmentation, multiple loss, and multi - resolution. With only 1 stage of training, the performance outperforms other open - source canny models ([diffusers/controlnet - canny - sdxl - 1.0], [TheMistoAI/MistoLine]). I release it and hope to advance the application of stable diffusion models. Canny is one of the most important ControlNet series models and can be applied to many jobs associated with drawing and designing.

✨ Features

Trained with over 10000000 high - quality images.
Utilizes data augmentation, multiple loss, and multi - resolution techniques during training.
Outperforms other open - source canny models with only 1 stage of training.
Can generate high - resolution images comparable to Midjourney.

📚 Documentation

Model Details

Model Description

Developed by: xinsir
Model type: ControlNet_SDXL
License: apache - 2.0
Finetuned from model [optional]: stabilityai/stable - diffusion - xl - base - 1.0

Model Sources [optional]

Paper [optional]: https://arxiv.org/abs/2302.05543

Uses

Examples

Prompt 1: A closeup of two day of the dead models, looking to the side, large flowered headdress, full dia de Los muertoe make up, lush red lips, butterflies, flowers, pastel colors, looking to the side, jungle, birds, color harmony, extremely detailed, intricate, ornate, motion, stunning, beautiful, unique, soft lighting
Prompt 2: ghost with a plague doctor mask in a venice carnaval hyper realistic
Prompt 3: A picture surrounded by blue stars and gold stars, glowing, dark navy blue and gray tones, distributed in light silver and gold, playful, festive atmosphere, pure fabric, chalk, FHD 8K
Prompt 4: Delicious vegetarian pizza with champignon mushrooms, tomatoes, mozzarella, peppers and black olives, isolated on white background, transparent isolated white background, top down view, studio photo, transparent png, Clean sharp focus. High - end retouching. Food magazine photography. Award winning photography. Advertising photography. Commercial photography
Prompt 5: a blonde woman in a wedding dress in a maple forest in summer with a flower crown laurel. Watercolor painting in the style of John William Waterhouse. Romanticism. Ethereal light.

Examples Anime(Note that you need to change the base model to CounterfeitXL, others remains the same)

images_5) images_6) images_7) images_8) images_9)

How to Get Started with the Model

from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline, AutoencoderKL
from diffusers import DDIMScheduler, EulerAncestralDiscreteScheduler
from PIL import Image
import torch
import numpy as np
import cv2

def HWC3(x):
    assert x.dtype == np.uint8
    if x.ndim == 2:
        x = x[:, :, None]
    assert x.ndim == 3
    H, W, C = x.shape
    assert C == 1 or C == 3 or C == 4
    if C == 3:
        return x
    if C == 1:
        return np.concatenate([x, x, x], axis=2)
    if C == 4:
        color = x[:, :, 0:3].astype(np.float32)
        alpha = x[:, :, 3:4].astype(np.float32) / 255.0
        y = color * alpha + 255.0 * (1.0 - alpha)
        y = y.clip(0, 255).astype(np.uint8)
        return y

controlnet_conditioning_scale = 1.0  
prompt = "your prompt, the longer the better, you can describe it as detail as possible"
negative_prompt = 'longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality'

eulera_scheduler = EulerAncestralDiscreteScheduler.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler")

controlnet = ControlNetModel.from_pretrained(
    "xinsir/controlnet-canny-sdxl-1.0",
    torch_dtype=torch.float16
)

# when test with other base model, you need to change the vae also.
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)

pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    vae=vae,
    safety_checker=None,
    torch_dtype=torch.float16,
    scheduler=eulera_scheduler,
)

# need to resize the image resolution to 1024 * 1024 or same bucket resolution to get the best performance

controlnet_img = cv2.imread("your image path")
height, width, _  = controlnet_img.shape
ratio = np.sqrt(1024. * 1024. / (width * height))
new_width, new_height = int(width * ratio), int(height * ratio)
controlnet_img = cv2.resize(controlnet_img, (new_width, new_height))

controlnet_img = cv2.Canny(controlnet_img, 100, 200)
controlnet_img = HWC3(controlnet_img)
controlnet_img = Image.fromarray(controlnet_img)

images = pipe(
    prompt,
    negative_prompt=negative_prompt,
    image=controlnet_img,
    controlnet_conditioning_scale=controlnet_conditioning_scale,
    width=new_width,
    height=new_height,
    num_inference_steps=30,
    ).images

images[0].save(f"your image save path, png format is usually better than jpg or webp in terms of image quality but got much bigger")

Evaluation Metric

Laion Aesthetic Score [https://laion.ai/blog/laion - aesthetics/]
PerceptualSimilarity [https://github.com/richzhang/PerceptualSimilarity]

Evaluation Data

The test data is randomly sampled from Midjourney upscale images with prompts. Since the purpose of the project is to let people draw images like Midjourney, and Midjourney’s users include a large number of professional designers, and the upscale images tend to have a higher beauty score and better prompt consistency, it is suitable to use them as the test set to judge the ability of ControlNet. We select 300 prompt - image pairs randomly and generate 4 images per prompt, totally 1200 images are generated. We calculate the Laion Aesthetic Score to measure the beauty and the PerceptualSimilarity to measure the control ability. We find that the quality of the images has a good consistency with the metric values. We compare our methods with other SOTA Hugging Face models and list the results below. Our model has the highest aesthetic score and can generate visually appealing images if you prompt it properly.

Quantitative Result

metric	xinsir/controlnet - canny - sdxl - 1.0	diffusers/controlnet - canny - sdxl - 1.0	TheMistoAI/MistoLine
laion_aesthetic	6.03	5.93	5.82
perceptual similarity	0.4200	0.5053	0.5387

laion_aesthetic (the higher the better)
perceptual similarity (the lower the better)

Note: The values are calculated when saved in webp format. If you save in png format, the aesthetic values will increase by 0.1 - 0.3, but the relative relation remains unchanged.

Training Details

The model is trained using high - quality data with only 1 stage of training. The resolution setting is the same as sdxl - base, 1024*1024. We use a random threshold to generate canny images like lvming zhang. It is essential to find proper hyperparameters to realize data augmentation, as setting it too easy or too hard will hurt the model performance. Besides, we use a random mask to randomly mask out a random percentage of canny images to force the model to learn more semantic meaning between the prompt and the line. We use over 10000000 images, which are carefully annotated. CogVLM is proved to be a powerful image caption model [https://github.com/THUDM/CogVLM?tab=readme - ov - file]. For comic images, it is recommended to use Waifu Tagger to generate special tags [https://huggingface.co/spaces/SmilingWolf/wd - tagger]. More than 64 A100s are used to train the model, and the real batch size is 2560 when using accumulated_grad_batches.

Training Data

The data consists of many sources, including Midjourney, Laion 5B, Danbooru, and so on. The data is carefully filtered and annotated.

Conclusion

In our evaluation, the model got a better aesthetic score in real images compared with stabilityai/stable - diffusion - xl - base - 1.0, and comparable performance in cartoon - style images. The model has better control ability when tested with perceptual similarity due to stronger data augmentation and more training steps. Besides, the model has a lower rate of generating abnormal images that tend to include some abnormal human structures.

📄 License

The model is released under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご