Stable Diffusion v1-4 Open-Source Image Generation Model - Generate Realistic Images for Free According to Text Instructions

Stable Diffusion V1 4

Developed by CompVis

Stable Diffusion is a latent text-to-image diffusion model capable of generating realistic images from any text input.

Image Generation Open Source License:Openrail #Text-to-Image Generation #Artistic Creation #High-Resolution Generation

Downloads 1.7M

Release Time : 8/20/2022

Model Overview

A diffusion-based text-to-image generation model primarily used for creating high-quality images from textual descriptions.

Model Features

High-Quality Image Generation

Capable of generating realistic images at 512x512 resolution from text input

Creative Art Tool

Ideal for generating artworks and creative designs

Open-Source License

Uses OpenRAIL-M license, allowing commercial use and weight redistribution

Model Capabilities

Text-to-Image Generation

Artistic Creation

Creative Design

Use Cases

Artistic Creation

Concept Art Generation

Generate concept artworks from textual descriptions

High-quality concept art images

Educational Tool

Used for creative education and art learning

Intuitive demonstration of creative concepts

Creative Design

Product Concept Design

Quickly generate product design concept images

Diverse design options

🚀 Stable Diffusion v1-4 Model Card

Stable Diffusion is a latent text-to-image diffusion model that can generate photo-realistic images from any text input. For more details on how Stable Diffusion works, refer to 🤗's Stable Diffusion with 🧨Diffusers blog.

The Stable-Diffusion-v1-4 checkpoint was initialized with the weights of the Stable-Diffusion-v1-2 checkpoint and then fine-tuned for 225k steps at a resolution of 512x512 on "laion-aesthetics v2 5+". Additionally, 10% of the text-conditioning was dropped to enhance classifier-free guidance sampling.

These weights are designed for use with the 🧨 Diffusers library. If you need the weights for the CompVis Stable Diffusion codebase, visit here.

✨ Features

Latent Text-to-Image Generation: Capable of generating high - quality, photo - realistic images based on text prompts.
Fine - Tuned Performance: Initialized from Stable - Diffusion - v1 - 2 and fine - tuned for better results.
Multiple Scheduler Support: Allows users to swap out noise schedulers.

📦 Installation

We recommend using 🤗's Diffusers library to run Stable Diffusion.

PyTorch

pip install --upgrade diffusers transformers scipy

💻 Usage Examples

Basic Usage

Running the pipeline with the default PNDM scheduler:

import torch
from diffusers import StableDiffusionPipeline

model_id = "CompVis/stable-diffusion-v1-4"
device = "cuda"

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to(device)

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  

image.save("astronaut_rides_horse.png")

Advanced Usage

Memory - Saving on GPU

If you have less than 4GB of GPU RAM, load the StableDiffusionPipeline in float16 precision and enable attention slicing:

import torch

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to(device)
pipe.enable_attention_slicing()

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  

image.save("astronaut_rides_horse.png")

Swapping Noise Scheduler

To use the Euler scheduler instead:

from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler

model_id = "CompVis/stable-diffusion-v1-4"

# Use the Euler scheduler here instead
scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  

image.save("astronaut_rides_horse.png")

JAX/Flax

To use StableDiffusion on TPUs and GPUs for faster inference:

import jax
import numpy as np
from flax.jax_utils import replicate
from flax.training.common_utils import shard

from diffusers import FlaxStableDiffusionPipeline

pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", revision="flax", dtype=jax.numpy.bfloat16
)

prompt = "a photo of an astronaut riding a horse on mars"

prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50

num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)

# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, num_samples)
prompt_ids = shard(prompt_ids)

images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))

Note: If you are limited by TPU memory, load the FlaxStableDiffusionPipeline in bfloat16 precision:

import jax
import numpy as np
from flax.jax_utils import replicate
from flax.training.common_utils import shard

from diffusers import FlaxStableDiffusionPipeline

pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", revision="bf16", dtype=jax.numpy.bfloat16
)

prompt = "a photo of an astronaut riding a horse on mars"

prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50

num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)

# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, num_samples)
prompt_ids = shard(prompt_ids)

images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))

📚 Documentation

Model Details

Property	Details
Developed by	Robin Rombach, Patrick Esser
Model Type	Diffusion - based text - to - image generation model
Language(s)	English
License	The CreativeML OpenRAIL M license is an Open RAIL M license, adapted from the work of BigScience and the RAIL Initiative in responsible AI licensing. See also the article about the BLOOM Open RAIL license on which our license is based.
Model Description	This is a model for generating and modifying images based on text prompts. It's a Latent Diffusion Model using a fixed, pretrained text encoder (CLIP ViT - L/14) as suggested in the Imagen paper.
Resources for more information	GitHub Repository, Paper.
Cite as	@InProceedings{Rombach_2022_CVPR, author = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj"orn}, title = {High-Resolution Image Synthesis With Latent Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {10684-10695} }

Uses

Direct Use

The model is for research purposes only. Possible research areas and tasks include:

Safe deployment of models with the potential to generate harmful content.
Probing and understanding the limitations and biases of generative models.
Generation of artworks and use in design and other artistic processes.
Applications in educational or creative tools.
Research on generative models.

Misuse, Malicious Use, and Out - of - Scope Use

Note: This section is taken from the [DALLE - MINI model card](https://huggingface.co/dalle - mini/dalle - mini), but applies to Stable Diffusion v1 as well.

The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating disturbing, distressing, or offensive images, or content that propagates historical or current stereotypes.

Out - of - Scope Use

The model was not trained to provide factual or true representations of people or events. Using it to generate such content is beyond its capabilities.

Misuse and Malicious Use

Using the model to generate cruel content towards individuals is a misuse. This includes, but is not limited to:

Generating demeaning, dehumanizing, or harmful representations of people, their environments, cultures, religions, etc.
Intentionally promoting or propagating discriminatory content or harmful stereotypes.
Impersonating individuals without consent.
Generating sexual content without the consent of those who might see it.
Spreading mis - and disinformation.
Representing egregious violence and gore.
Sharing copyrighted or licensed material in violation of its terms of use.
Sharing altered copyrighted or licensed material in violation of its terms of use.

Limitations and Bias

Limitations

The model does not achieve perfect photorealism.
It cannot render legible text.
It performs poorly on complex composition tasks, such as rendering “A red cube on top of a blue sphere”.
Faces and people may not be generated properly.
Trained mainly with English captions, it works less well in other languages.
The autoencoding part of the model is lossy.
Trained on [LAION - 5B](https://laion.ai/blog/laion - 5b/), which contains adult material and is unfit for product use without additional safety measures.
The training dataset was not deduplicated, leading to some memorization of duplicated images. The training data can be searched at [https://rom1504.github.io/clip - retrieval/](https://rom1504.github.io/clip - retrieval/) to detect memorized images.

Bias

Image generation models can reinforce or exacerbate social biases. Stable Diffusion v1 was trained on subsets of [LAION - 2B(en)](https://laion.ai/blog/laion - 5b/), mainly with English descriptions. Texts and images from non - English communities and cultures are under - represented, affecting the overall output. White and western cultures are often the default, and the model performs worse with non - English prompts.

Safety Module

The model is intended to be used with the Safety Checker in Diffusers. This checker compares model outputs against known hard - coded NSFW concepts in the embedding space of the CLIPTextModel after image generation. The concepts are hidden to prevent reverse - engineering.

Training

Training Data

The model was trained on LAION - 2B (en) and its subsets.

Training Procedure

Stable Diffusion v1 - 4 is a latent diffusion model that combines an autoencoder with a diffusion model trained in the autoencoder's latent space.

🔧 Technical Details

The Stable - Diffusion - v1 - 4 checkpoint was initialized with the weights of [Stable - Diffusion - v1 - 2](https:/steps/huggingface.co/CompVis/stable - diffusion - v1 - 2) and fine - tuned on 225k steps at a resolution of 512x512 on "laion - aesthetics v2 5+". 10% of the text - conditioning was dropped to improve classifier - free guidance sampling.

📄 License

This model is open access and available to all, with a CreativeML OpenRAIL - M license further specifying rights and usage.

⚠️ Important Note

You can't use the model to deliberately produce nor share illegal or harmful outputs or content.

The authors claim no rights on the outputs you generate. You are free to use them but accountable for their use, which must not violate the license provisions.

You may re - distribute the weights and use the model commercially and/or as a service. If you do, include the same use restrictions as in the license and share a copy of the CreativeML OpenRAIL - M with all your users.

Please read the full license carefully here: https://huggingface.co/spaces/CompVis/stable - diffusion - license

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご