Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Stable Diffusion v1-4 Model Card
Stable Diffusion is a latent text-to-image diffusion model that can generate photo-realistic images from any text input. For more details on how Stable Diffusion works, refer to 🤗's Stable Diffusion with 🧨Diffusers blog.
The Stable-Diffusion-v1-4 checkpoint was initialized with the weights of the Stable-Diffusion-v1-2 checkpoint and then fine-tuned for 225k steps at a resolution of 512x512 on "laion-aesthetics v2 5+". Additionally, 10% of the text-conditioning was dropped to enhance classifier-free guidance sampling.
These weights are designed for use with the 🧨 Diffusers library. If you need the weights for the CompVis Stable Diffusion codebase, visit here.
✨ Features
- Latent Text-to-Image Generation: Capable of generating high - quality, photo - realistic images based on text prompts.
- Fine - Tuned Performance: Initialized from Stable - Diffusion - v1 - 2 and fine - tuned for better results.
- Multiple Scheduler Support: Allows users to swap out noise schedulers.
📦 Installation
We recommend using 🤗's Diffusers library to run Stable Diffusion.
PyTorch
pip install --upgrade diffusers transformers scipy
💻 Usage Examples
Basic Usage
Running the pipeline with the default PNDM scheduler:
import torch
from diffusers import StableDiffusionPipeline
model_id = "CompVis/stable-diffusion-v1-4"
device = "cuda"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to(device)
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image.save("astronaut_rides_horse.png")
Advanced Usage
Memory - Saving on GPU
If you have less than 4GB of GPU RAM, load the StableDiffusionPipeline in float16 precision and enable attention slicing:
import torch
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to(device)
pipe.enable_attention_slicing()
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image.save("astronaut_rides_horse.png")
Swapping Noise Scheduler
To use the Euler scheduler instead:
from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
model_id = "CompVis/stable-diffusion-v1-4"
# Use the Euler scheduler here instead
scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image.save("astronaut_rides_horse.png")
JAX/Flax
To use StableDiffusion on TPUs and GPUs for faster inference:
import jax
import numpy as np
from flax.jax_utils import replicate
from flax.training.common_utils import shard
from diffusers import FlaxStableDiffusionPipeline
pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4", revision="flax", dtype=jax.numpy.bfloat16
)
prompt = "a photo of an astronaut riding a horse on mars"
prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50
num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)
# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, num_samples)
prompt_ids = shard(prompt_ids)
images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
Note: If you are limited by TPU memory, load the FlaxStableDiffusionPipeline
in bfloat16
precision:
import jax
import numpy as np
from flax.jax_utils import replicate
from flax.training.common_utils import shard
from diffusers import FlaxStableDiffusionPipeline
pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4", revision="bf16", dtype=jax.numpy.bfloat16
)
prompt = "a photo of an astronaut riding a horse on mars"
prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50
num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)
# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, num_samples)
prompt_ids = shard(prompt_ids)
images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
📚 Documentation
Model Details
Property | Details |
---|---|
Developed by | Robin Rombach, Patrick Esser |
Model Type | Diffusion - based text - to - image generation model |
Language(s) | English |
License | The CreativeML OpenRAIL M license is an Open RAIL M license, adapted from the work of BigScience and the RAIL Initiative in responsible AI licensing. See also the article about the BLOOM Open RAIL license on which our license is based. |
Model Description | This is a model for generating and modifying images based on text prompts. It's a Latent Diffusion Model using a fixed, pretrained text encoder (CLIP ViT - L/14) as suggested in the Imagen paper. |
Resources for more information | GitHub Repository, Paper. |
Cite as | @InProceedings{Rombach_2022_CVPR, |
Uses
Direct Use
The model is for research purposes only. Possible research areas and tasks include:
- Safe deployment of models with the potential to generate harmful content.
- Probing and understanding the limitations and biases of generative models.
- Generation of artworks and use in design and other artistic processes.
- Applications in educational or creative tools.
- Research on generative models.
Misuse, Malicious Use, and Out - of - Scope Use
Note: This section is taken from the [DALLE - MINI model card](https://huggingface.co/dalle - mini/dalle - mini), but applies to Stable Diffusion v1 as well.
The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating disturbing, distressing, or offensive images, or content that propagates historical or current stereotypes.
Out - of - Scope Use
The model was not trained to provide factual or true representations of people or events. Using it to generate such content is beyond its capabilities.
Misuse and Malicious Use
Using the model to generate cruel content towards individuals is a misuse. This includes, but is not limited to:
- Generating demeaning, dehumanizing, or harmful representations of people, their environments, cultures, religions, etc.
- Intentionally promoting or propagating discriminatory content or harmful stereotypes.
- Impersonating individuals without consent.
- Generating sexual content without the consent of those who might see it.
- Spreading mis - and disinformation.
- Representing egregious violence and gore.
- Sharing copyrighted or licensed material in violation of its terms of use.
- Sharing altered copyrighted or licensed material in violation of its terms of use.
Limitations and Bias
Limitations
- The model does not achieve perfect photorealism.
- It cannot render legible text.
- It performs poorly on complex composition tasks, such as rendering “A red cube on top of a blue sphere”.
- Faces and people may not be generated properly.
- Trained mainly with English captions, it works less well in other languages.
- The autoencoding part of the model is lossy.
- Trained on [LAION - 5B](https://laion.ai/blog/laion - 5b/), which contains adult material and is unfit for product use without additional safety measures.
- The training dataset was not deduplicated, leading to some memorization of duplicated images. The training data can be searched at [https://rom1504.github.io/clip - retrieval/](https://rom1504.github.io/clip - retrieval/) to detect memorized images.
Bias
Image generation models can reinforce or exacerbate social biases. Stable Diffusion v1 was trained on subsets of [LAION - 2B(en)](https://laion.ai/blog/laion - 5b/), mainly with English descriptions. Texts and images from non - English communities and cultures are under - represented, affecting the overall output. White and western cultures are often the default, and the model performs worse with non - English prompts.
Safety Module
The model is intended to be used with the Safety Checker in Diffusers. This checker compares model outputs against known hard - coded NSFW concepts in the embedding space of the CLIPTextModel
after image generation. The concepts are hidden to prevent reverse - engineering.
Training
Training Data
The model was trained on LAION - 2B (en) and its subsets.
Training Procedure
Stable Diffusion v1 - 4 is a latent diffusion model that combines an autoencoder with a diffusion model trained in the autoencoder's latent space.
🔧 Technical Details
The Stable - Diffusion - v1 - 4 checkpoint was initialized with the weights of [Stable - Diffusion - v1 - 2](https:/steps/huggingface.co/CompVis/stable - diffusion - v1 - 2) and fine - tuned on 225k steps at a resolution of 512x512 on "laion - aesthetics v2 5+". 10% of the text - conditioning was dropped to improve classifier - free guidance sampling.
📄 License
This model is open access and available to all, with a CreativeML OpenRAIL - M license further specifying rights and usage.
⚠️ Important Note
- You can't use the model to deliberately produce nor share illegal or harmful outputs or content.
- The authors claim no rights on the outputs you generate. You are free to use them but accountable for their use, which must not violate the license provisions.
- You may re - distribute the weights and use the model commercially and/or as a service. If you do, include the same use restrictions as in the license and share a copy of the CreativeML OpenRAIL - M with all your users.
Please read the full license carefully here: https://huggingface.co/spaces/CompVis/stable - diffusion - license