Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Stable Diffusion v1 - 4 Model Card
Stable Diffusion is a latent text - to - image diffusion model that can generate photo - realistic images based on any text input. For more details about how Stable Diffusion works, refer to 🤗's Stable Diffusion with 🧨Diffusers blog.
The Stable - Diffusion - v1 - 4 checkpoint was initialized with the weights of the [Stable - Diffusion - v1 - 2](https:/steps/huggingface.co/CompVis/stable - diffusion - v1 - 2) checkpoint and then fine - tuned for 225k steps at a resolution of 512x512 on "laion - aesthetics v2 5+" with a 10% drop of text - conditioning to enhance classifier - free guidance sampling.
These weights are designed to be used with the 🧨 Diffusers library. If you need the weights for the CompVis Stable Diffusion codebase, [click here](https://huggingface.co/CompVis/stable - diffusion - v - 1 - 4 - original).
✨ Features
- Capable of generating photo - realistic images from text prompts.
- Fine - tuned to improve classifier - free guidance sampling.
- Can be used with different noise schedulers.
- Supports both PyTorch and JAX/Flax for inference.
📦 Installation
We recommend using 🤗's Diffusers library to run Stable Diffusion.
PyTorch
pip install --upgrade diffusers transformers scipy
💻 Usage Examples
Basic Usage
import torch
from diffusers import StableDiffusionPipeline
model_id = "CompVis/stable-diffusion-v1-4"
device = "cuda"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to(device)
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image.save("astronaut_rides_horse.png")
Advanced Usage
Using a Different Noise Scheduler
from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
model_id = "CompVis/stable-diffusion-v1-4"
# Use the Euler scheduler here instead
scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image.save("astronaut_rides_horse.png")
JAX/Flax
import jax
import numpy as np
from flax.jax_utils import replicate
from flax.training.common_utils import shard
from diffusers import FlaxStableDiffusionPipeline
pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4", revision="flax", dtype=jax.numpy.bfloat16
)
prompt = "a photo of an astronaut riding a horse on mars"
prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50
num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)
# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, num_samples)
prompt_ids = shard(prompt_ids)
images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
📚 Documentation
Model Details
Property | Details |
---|---|
Developed by | Robin Rombach, Patrick Esser |
Model Type | Diffusion - based text - to - image generation model |
Language(s) | English |
License | [The CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable - diffusion - license) is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming - convention - of - responsible - ai - licenses), adapted from the work that BigScience and the RAIL Initiative are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the - bigscience - rail - license) on which our license is based. |
Model Description | This is a model that can be used to generate and modify images based on text prompts. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (CLIP ViT - L/14) as suggested in the Imagen paper. |
Resources for more information | [GitHub Repository](https://github.com/CompVis/stable - diffusion), Paper. |
Cite as | @InProceedings{Rombach_2022_CVPR, |
Uses
Direct Use
The model is intended for research purposes only. Possible research areas and tasks include:
- Safe deployment of models which have the potential to generate harmful content.
- Probing and understanding the limitations and biases of generative models.
- Generation of artworks and use in design and other artistic processes.
- Applications in educational or creative tools.
- Research on generative models.
Misuse, Malicious Use, and Out-of-Scope Use
The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
Limitations and Bias
Limitations
- The model does not achieve perfect photorealism.
- The model cannot render legible text.
- The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”.
- Faces and people in general may not be generated properly.
- The model was trained mainly with English captions and will not work as well in other languages.
- The autoencoding part of the model is lossy.
- The model was trained on a large - scale dataset [LAION - 5B](https://laion.ai/blog/laion - 5b/) which contains adult material and is not fit for product use without additional safety mechanisms and considerations.
- No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data. The training data can be searched at [https://rom1504.github.io/clip - retrieval/](https://rom1504.github.io/clip - retrieval/) to possibly assist in the detection of memorized images.
Bias
While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases. Stable Diffusion v1 was trained on subsets of [LAION - 2B(en)](https://laion.ai/blog/laion - 5b/), which consists of images that are primarily limited to English descriptions. Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for. This affects the overall output of the model, as white and western cultures are often set as the default. Further, the ability of the model to generate content with non - English prompts is significantly worse than with English - language prompts.
Safety Module
The intended use of this model is with the Safety Checker in Diffusers. This checker works by checking model outputs against known hard - coded NSFW concepts. The concepts are intentionally hidden to reduce the likelihood of reverse - engineering this filter. Specifically, the checker compares the class probability of harmful concepts in the embedding space of the CLIPTextModel
after generation of the images. The concepts are passed into the model with the generated image and compared to a hand - engineered weight for each NSFW concept.
🔧 Technical Details
The Stable - Diffusion - v1 - 4 checkpoint was initialized with the weights of the [Stable - Diffusion - v1 - 2](https:/steps/huggingface.co/CompVis/stable - diffusion - v1 - 2) checkpoint and subsequently fine - tuned on 225k steps at resolution 512x512 on "laion - aesthetics v2 5+" and 10% dropping of the text - conditioning to improve classifier - free guidance sampling.
📄 License
The model is licensed under [The CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable - diffusion - license).
⚠️ Important Note
This model is open access and available to all, with a CreativeML OpenRAIL - M license further specifying rights and usage. The CreativeML OpenRAIL License specifies:
- You can't use the model to deliberately produce nor share illegal or harmful outputs or content.
- The authors claim no rights on the outputs you generate, you are free to use them and are accountable for their use which must not go against the provisions set in the license.
- You may re - distribute the weights and use the model commercially and/or as a service. If you do, please be aware you have to include the same use restrictions as the ones in the license and share a copy of the CreativeML OpenRAIL - M to all your users (please read the license entirely and carefully). Please read the full license carefully here: https://huggingface.co/spaces/CompVis/stable - diffusion - license
💡 Usage Tip
If you are limited by GPU memory and have less than 4GB of GPU RAM available, please make sure to load the StableDiffusionPipeline in float16 precision instead of the default float32 precision. You can do so by telling diffusers to expect the weights to be in float16 precision. If you are limited by TPU memory, please make sure to load the
FlaxStableDiffusionPipeline
inbfloat16
precision instead of the defaultfloat32
precision. You can do so by telling diffusers to load the weights from "bf16" branch.