Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Stable Diffusion v1 Model Card
Stable Diffusion is a latent text-to-image diffusion model. It can generate photo-realistic images from any text input, offering high - quality image generation capabilities for various applications.
🚀 Quick Start
The Stable-Diffusion-v-1-3 checkpoint was initialized with the weights of the Stable-Diffusion-v1-2 checkpoint. It was then fine - tuned on 195,000 steps at a resolution of 512x512
on "laion - improved - aesthetics", with a 10% drop in text - conditioning to enhance classifier - free guidance sampling.
Download the weights
These weights are intended for use with the original CompVis Stable Diffusion codebase. If you're looking for the model to use with the D🧨iffusers library, click here.
✨ Features
- Text - to - Image Generation: Capable of generating photo - realistic images from text prompts.
- Fine - Tuned Checkpoints: Different checkpoints are available, such as
sd - v1 - 1.ckpt
,sd - v1 - 2.ckpt
, andsd - v1 - 3.ckpt
, each with specific training procedures.
📚 Documentation
Model Details
Property | Details |
---|---|
Developed by | Robin Rombach, Patrick Esser |
Model Type | Diffusion - based text - to - image generation model |
Language(s) | English |
License | The CreativeML OpenRAIL M license, an Open RAIL M license, adapted from the work of BigScience and the RAIL Initiative in responsible AI licensing. See also the article about the BLOOM Open RAIL license on which this license is based. |
Model Description | A model for generating and modifying images based on text prompts. It's a Latent Diffusion Model using a fixed, pretrained text encoder (CLIP ViT - L/14) as suggested in the Imagen paper. |
Resources for more information | GitHub Repository, Paper |
Cite as | @InProceedings{Rombach_2022_CVPR, |
Uses
Direct Use
The model is for research purposes only. Possible research areas and tasks include:
- Safe deployment of models with the potential to generate harmful content.
- Probing and understanding the limitations and biases of generative models.
- Generation of artworks and use in design and other artistic processes.
- Applications in educational or creative tools.
- Research on generative models.
Misuse, Malicious Use, and Out - of - Scope Use
⚠️ Important Note
This section is taken from the [DALLE - MINI model card](https://huggingface.co/dalle - mini/dalle - mini), but applies equally to Stable Diffusion v1.
The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating disturbing, distressing, or offensive images, or content that propagates stereotypes.
- Out - of - Scope Use: The model was not trained to provide factual or true representations of people or events. Using it for such purposes is beyond its capabilities.
- Misuse and Malicious Use: Using the model to generate cruel content towards individuals is a misuse. This includes generating demeaning, discriminatory, or otherwise harmful representations, impersonating individuals without consent, generating non - consensual sexual content, spreading mis - and disinformation, representing egregious violence and gore, and sharing copyrighted or licensed material in violation of its terms of use.
Limitations and Bias
Limitations
- The model does not achieve perfect photorealism.
- It cannot render legible text.
- It performs poorly on difficult tasks involving compositionality, like rendering an image of “A red cube on top of a blue sphere”.
- Faces and people may not be generated properly.
- Trained mainly with English captions, it works less well in other languages.
- The autoencoding part of the model is lossy.
- Trained on the [LAION - 5B](https://laion.ai/blog/laion - 5b/) dataset, which contains adult material and is unfit for product use without additional safety measures.
- No deduplication measures were used on the dataset, resulting in some memorization of duplicated training images. The training data can be searched at [https://rom1504.github.io/clip - retrieval/](https://rom1504.github.io/clip - retrieval/) to detect memorized images.
Bias
While image generation models are impressive, they can reinforce or exacerbate social biases. Stable Diffusion v1 was trained on subsets of [LAION - 2B(en)](https://laion.ai/blog/laion - 5b/), mainly with English descriptions. Texts and images from non - English communities are likely under - represented, affecting the model's output, with white and western cultures often being the default. The model also performs significantly worse with non - English prompts.
Training
Training Data
The model was trained using the following dataset:
- LAION - 2B (en) and its subsets.
Training Procedure
Stable Diffusion v1 is a latent diffusion model combining an autoencoder with a diffusion model trained in the autoencoder's latent space. During training:
- Images are encoded by an encoder into latent representations. The autoencoder has a relative downsampling factor of 8, mapping images of shape H x W x 3 to latents of shape H/f x W/f x 4.
- Text prompts are encoded by a ViT - L/14 text - encoder.
- The non - pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross - attention.
- The loss is a reconstruction objective between the added noise in the latent and the UNet's prediction.
We currently offer three checkpoints, sd - v1 - 1.ckpt
, sd - v1 - 2.ckpt
, and sd - v1 - 3.ckpt
, trained as follows:
-
sd - v1 - 1.ckpt
: 237k steps at resolution256x256
on [laion2B - en](https://huggingface.co/datasets/laion/laion2B - en). 194k steps at resolution512x512
on [laion - high - resolution](https://huggingface.co/datasets/laion/laion - high - resolution) (170M examples from LAION - 5B with resolution>= 1024x1024
). -
sd - v1 - 2.ckpt
: Resumed fromsd - v1 - 1.ckpt
. 515k steps at resolution512x512
on "laion - improved - aesthetics" (a subset of laion2B - en, filtered to images with an original size>= 512x512
, estimated aesthetics score> 5.0
, and an estimated watermark probability< 0.5
. The watermark estimate is from the LAION - 5B metadata, and the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved - aesthetic - predictor)). -
sd - v1 - 3.ckpt
: Resumed fromsd - v1 - 2.ckpt
. 195k steps at resolution512x512
on "laion - improved - aesthetics" and 10% dropping of the text - conditioning to improve classifier - free guidance sampling. -
Hardware: 32 x 8 x A100 GPUs
-
Optimizer: AdamW
-
Gradient Accumulations: 2
-
Batch: 32 x 8 x 2 x 4 = 2048
-
Learning rate: Warmed up to 0.0001 for 10,000 steps and then kept constant.
Evaluation Results
Evaluations with different classifier - free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 PLMS sampling steps show the relative improvements of the checkpoints:

Evaluated using 50 PLMS steps and 10000 random prompts from the COCO2017 validation set at 512x512 resolution. Not optimized for FID scores.
Environmental Impact
Stable Diffusion v1 Estimated Emissions
Based on the provided information, we estimate the following CO2 emissions using the Machine Learning Impact calculator from Lacoste et al. (2019). The hardware, runtime, cloud provider, and compute region were used to estimate the carbon impact.
- Hardware Type: A100 PCIe 40GB
- Hours used: 150000
- Cloud Provider: AWS
- Compute Region: US - east
- Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 11250 kg CO2 eq.
📄 License
This model is open access and available to all, with a CreativeML OpenRAIL - M license further specifying rights and usage. The CreativeML OpenRAIL License specifies:
- You can't use the model to deliberately produce nor share illegal or harmful outputs or content.
- The authors claim no rights on the outputs you generate. You are free to use them but accountable for their use, which must not violate the license provisions.
- You may re - distribute the weights and use the model commercially and/or as a service. If you do, you must include the same use restrictions as in the license and share a copy of the CreativeML OpenRAIL - M with all your users (please read the license in full).
Please read the full license carefully here.







