Stable Diffusion v1.3 Open-source Image Generation Model - Create Realistic Beautiful Images Based on Text, Free to Use

Stable Diffusion V 1 3 Original

Developed by CompVis

Stable Diffusion is a latent diffusion model capable of generating realistic images from text inputs. Version 1.3 includes aesthetic optimization training based on v1.2.

Text-to-Image

PyTorch

Open Source License:Openrail #Text-to-image diffusion model #512x512 high resolution #Art creation tool

Downloads 17

Release Time : 8/10/2022

Model Overview

A latent diffusion model for generating and modifying images based on text prompts, using CLIP ViT-L/14 as the text encoder, supporting high-resolution image synthesis.

Model Features

High-resolution image generation

Supports 512x512 resolution image synthesis with aesthetic optimization training

Classifier-free guidance sampling

Uses 10% text condition dropout technique to optimize generation quality

Open license

Allows commercial use and weight redistribution, following responsible AI principles

Model Capabilities

Text-to-image generation

Image modification

Art creation

Design assistance

Use Cases

Art creation

Concept art generation

Quickly generates concept art sketches based on text descriptions

Accelerates the creative process and provides diverse design options

Education and research

Generative model research

Used to explore the limitations and biases of diffusion models

Advances the field of AI safety

🚀 Stable Diffusion v1 Model Card

Stable Diffusion is a latent text-to-image diffusion model. It can generate photo-realistic images from any text input, offering high - quality image generation capabilities for various applications.

🚀 Quick Start

The Stable-Diffusion-v-1-3 checkpoint was initialized with the weights of the Stable-Diffusion-v1-2 checkpoint. It was then fine - tuned on 195,000 steps at a resolution of 512x512 on "laion - improved - aesthetics", with a 10% drop in text - conditioning to enhance classifier - free guidance sampling.

Download the weights

These weights are intended for use with the original CompVis Stable Diffusion codebase. If you're looking for the model to use with the D🧨iffusers library, click here.

✨ Features

Text - to - Image Generation: Capable of generating photo - realistic images from text prompts.
Fine - Tuned Checkpoints: Different checkpoints are available, such as sd - v1 - 1.ckpt, sd - v1 - 2.ckpt, and sd - v1 - 3.ckpt, each with specific training procedures.

📚 Documentation

Model Details

Property	Details
Developed by	Robin Rombach, Patrick Esser
Model Type	Diffusion - based text - to - image generation model
Language(s)	English
License	The CreativeML OpenRAIL M license, an Open RAIL M license, adapted from the work of BigScience and the RAIL Initiative in responsible AI licensing. See also the article about the BLOOM Open RAIL license on which this license is based.
Model Description	A model for generating and modifying images based on text prompts. It's a Latent Diffusion Model using a fixed, pretrained text encoder (CLIP ViT - L/14) as suggested in the Imagen paper.
Resources for more information	GitHub Repository, Paper
Cite as	@InProceedings{Rombach_2022_CVPR, author = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj"orn}, title = {High - Resolution Image Synthesis With Latent Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {10684 - 10695} }

Uses

Direct Use

The model is for research purposes only. Possible research areas and tasks include:

Safe deployment of models with the potential to generate harmful content.
Probing and understanding the limitations and biases of generative models.
Generation of artworks and use in design and other artistic processes.
Applications in educational or creative tools.
Research on generative models.

Misuse, Malicious Use, and Out - of - Scope Use

⚠️ Important Note

This section is taken from the [DALLE - MINI model card](https://huggingface.co/dalle - mini/dalle - mini), but applies equally to Stable Diffusion v1.

The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating disturbing, distressing, or offensive images, or content that propagates stereotypes.

Out - of - Scope Use: The model was not trained to provide factual or true representations of people or events. Using it for such purposes is beyond its capabilities.
Misuse and Malicious Use: Using the model to generate cruel content towards individuals is a misuse. This includes generating demeaning, discriminatory, or otherwise harmful representations, impersonating individuals without consent, generating non - consensual sexual content, spreading mis - and disinformation, representing egregious violence and gore, and sharing copyrighted or licensed material in violation of its terms of use.

Limitations and Bias

Limitations

The model does not achieve perfect photorealism.
It cannot render legible text.
It performs poorly on difficult tasks involving compositionality, like rendering an image of “A red cube on top of a blue sphere”.
Faces and people may not be generated properly.
Trained mainly with English captions, it works less well in other languages.
The autoencoding part of the model is lossy.
Trained on the [LAION - 5B](https://laion.ai/blog/laion - 5b/) dataset, which contains adult material and is unfit for product use without additional safety measures.
No deduplication measures were used on the dataset, resulting in some memorization of duplicated training images. The training data can be searched at [https://rom1504.github.io/clip - retrieval/](https://rom1504.github.io/clip - retrieval/) to detect memorized images.

Bias

While image generation models are impressive, they can reinforce or exacerbate social biases. Stable Diffusion v1 was trained on subsets of [LAION - 2B(en)](https://laion.ai/blog/laion - 5b/), mainly with English descriptions. Texts and images from non - English communities are likely under - represented, affecting the model's output, with white and western cultures often being the default. The model also performs significantly worse with non - English prompts.

Training

Training Data

The model was trained using the following dataset:

LAION - 2B (en) and its subsets.

Training Procedure

Stable Diffusion v1 is a latent diffusion model combining an autoencoder with a diffusion model trained in the autoencoder's latent space. During training:

Images are encoded by an encoder into latent representations. The autoencoder has a relative downsampling factor of 8, mapping images of shape H x W x 3 to latents of shape H/f x W/f x 4.
Text prompts are encoded by a ViT - L/14 text - encoder.
The non - pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross - attention.
The loss is a reconstruction objective between the added noise in the latent and the UNet's prediction.

We currently offer three checkpoints, sd - v1 - 1.ckpt, sd - v1 - 2.ckpt, and sd - v1 - 3.ckpt, trained as follows:

sd - v1 - 1.ckpt: 237k steps at resolution 256x256 on [laion2B - en](https://huggingface.co/datasets/laion/laion2B - en). 194k steps at resolution 512x512 on [laion - high - resolution](https://huggingface.co/datasets/laion/laion - high - resolution) (170M examples from LAION - 5B with resolution >= 1024x1024).
sd - v1 - 2.ckpt: Resumed from sd - v1 - 1.ckpt. 515k steps at resolution 512x512 on "laion - improved - aesthetics" (a subset of laion2B - en, filtered to images with an original size >= 512x512, estimated aesthetics score > 5.0, and an estimated watermark probability < 0.5. The watermark estimate is from the LAION - 5B metadata, and the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved - aesthetic - predictor)).
sd - v1 - 3.ckpt: Resumed from sd - v1 - 2.ckpt. 195k steps at resolution 512x512 on "laion - improved - aesthetics" and 10% dropping of the text - conditioning to improve classifier - free guidance sampling.
Hardware: 32 x 8 x A100 GPUs
Optimizer: AdamW
Gradient Accumulations: 2
Batch: 32 x 8 x 2 x 4 = 2048
Learning rate: Warmed up to 0.0001 for 10,000 steps and then kept constant.

Evaluation Results

Evaluations with different classifier - free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 PLMS sampling steps show the relative improvements of the checkpoints:

![pareto](https://huggingface.co/CompVis/stable-diffusion/resolve/main/v1 - variants - scores.jpg)

Evaluated using 50 PLMS steps and 10000 random prompts from the COCO2017 validation set at 512x512 resolution. Not optimized for FID scores.

Environmental Impact

Stable Diffusion v1 Estimated Emissions

Based on the provided information, we estimate the following CO2 emissions using the Machine Learning Impact calculator from Lacoste et al. (2019). The hardware, runtime, cloud provider, and compute region were used to estimate the carbon impact.

Hardware Type: A100 PCIe 40GB
Hours used: 150000
Cloud Provider: AWS
Compute Region: US - east
Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 11250 kg CO2 eq.

📄 License

This model is open access and available to all, with a CreativeML OpenRAIL - M license further specifying rights and usage. The CreativeML OpenRAIL License specifies:

You can't use the model to deliberately produce nor share illegal or harmful outputs or content.
The authors claim no rights on the outputs you generate. You are free to use them but accountable for their use, which must not violate the license provisions.
You may re - distribute the weights and use the model commercially and/or as a service. If you do, you must include the same use restrictions as in the license and share a copy of the CreativeML OpenRAIL - M with all your users (please read the license in full).

Please read the full license carefully here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご