Versatile Diffusion Open-Source Model - Supports Mutual Conversion and Editing between Images and Text, Super Practical for Multimodal Creation!

Versatile Diffusion

Developed by shi-labs

The first unified multi-stream multimodal diffusion framework supporting bidirectional image-text conversion and editing

Text-to-Image Open Source License:MIT #Multimodal Diffusion Framework #Bidirectional Text-Image Generation #Disentangled Editing

Downloads 8,455

Release Time : 11/22/2022

Model Overview

Versatile Diffusion (VD) is a multimodal generative model natively supporting tasks like image-to-text, image variation, text-to-image, and text variation, with extensibility to applications like semantic-style disentanglement and dual-guided generation.

Model Features

Unified Multimodal Framework

The first unified diffusion framework supporting bidirectional image-text conversion and editing

Multi-Stream Architecture

Flexibly handles different modal tasks through composable flow modules

High Extensibility

Extendable to advanced applications like semantic-style disentanglement and dual-guided generation

Model Capabilities

Text-to-image generation

Image variation generation

Image captioning

Text-image dual-guided generation

Latent space editing

Use Cases

Creative Design

Concept Art Generation

Generate sci-fi scenes from text prompts (e.g., 'an astronaut riding a horse on Mars')

Semantically coherent creative images

Image Editing

Style Transfer

Modify image style via dual guidance (e.g., transforming a regular car into 'a red car under sunlight')

Style-transferred outputs with content consistency

🚀 Versatile Diffusion V1.0 Model Card

Versatile Diffusion (VD) is the first unified multi - flow multimodal diffusion framework, serving as a step towards Universal Generative AI. It can natively support image - to - text, image - variation, text - to - image, and text - variation. Moreover, it can be extended to other applications like semantic - style disentanglement, image - text dual - guided generation, and latent image - to - text - to - image editing. Future versions will support more modalities such as speech, music, video, and 3D.

For more information, visit GitHub and arXiv.

🚀 Quick Start

This README provides details about the Versatile Diffusion V1.0 model, including its features, usage, and cautions.

✨ Features

Unified Multi - flow Framework: Supports multiple tasks like image - to - text, image - variation, text - to - image, and text - variation natively.
Extensible: Can be extended to other applications such as semantic - style disentanglement and image - text dual - guided generation.
Future - proof: Future versions will support more modalities like speech, music, video, and 3D.

📦 Installation

To use the model, you need to install the necessary libraries.

⚠️ Important Note

Make sure to install transformers from "main" in order to use this model.

pip install git+https://github.com/huggingface/transformers

💻 Usage Examples

Basic Usage

You can use the model with the 🧨Diffusers library or the SHI - Labs Versatile Diffusion codebase.

To use Versatile Diffusion for all tasks, it is recommended to use the VersatileDiffusionPipeline

#! pip install git+https://github.com/huggingface/transformers diffusers torch
from diffusers import VersatileDiffusionPipeline
import torch
import requests
from io import BytesIO
from PIL import Image

pipe = VersatileDiffusionPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# prompt
prompt = "a red car"

# initial image
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

# text to image
image = pipe.text_to_image(prompt).images[0]

# image variation
image = pipe.image_variation(image).images[0]

# image variation
image = pipe.dual_guided(prompt, image).images[0]

Advanced Usage

Task Specific Pipelines

The task - specific pipelines load only the weights that are needed onto GPU. You can find all task - specific pipelines here.

Text to Image

from diffusers import VersatileDiffusionTextToImagePipeline
import torch

pipe = VersatileDiffusionTextToImagePipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe.remove_unused_weights()
pipe = pipe.to("cuda")

generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe("an astronaut riding on a horse on mars", generator=generator).images[0]
image.save("./astronaut.png")

Image variations

from diffusers import VersatileDiffusionImageVariationPipeline
import torch
import requests
from io import BytesIO
from PIL import Image

# download an initial image
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

pipe = VersatileDiffusionImageVariationPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe(image, generator=generator).images[0]
image.save("./car_variation.png")

Dual - guided generation

from diffusers import VersatileDiffusionDualGuidedPipeline
import torch
import requests
from io import BytesIO
from PIL import Image

# download an initial image
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"

response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
text = "a red car in the sun"

pipe = VersatileDiffusionDualGuidedPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe.remove_unused_weights()
pipe = pipe.to("cuda")

generator = torch.Generator(device="cuda").manual_seed(0)
text_to_image_strength = 0.75

image = pipe(prompt=text, image=image, text_to_image_strength=text_to_image_strength, generator=generator).images[0]
image.save("./red_car.png")

Original GitHub Repository

Follow the instructions here.

📚 Documentation

Model Details

One single flow of Versatile Diffusion contains a VAE, a diffuser, and a context encoder, and thus handles one task (e.g., text - to - image) under one data type (e.g., image) and one context type (e.g., text). The multi - flow structure of Versatile Diffusion is shown in the following diagram:

Property	Details
Developed by	Xingqian Xu, Atlas Wang, Eric Zhang, Kai Wang, and Humphrey Shi
Model Type	Diffusion - based multimodal generation model
Language(s)	English
License	MIT
Resources for more information	GitHub Repository, Paper
Cite as	`@article{xu2022versatile, title = {Versatile Diffusion: Text, Images and Variations All in One Diffusion Model}, author = {Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi}, year = 2022, url = {https://arxiv.org/abs/2211.08332}, eprint = {2211.08332}, archiveprefix = {arXiv}, primaryclass = {cs.CV}}`

🔧 Technical Details

One single flow of Versatile Diffusion contains a VAE, a diffuser, and a context encoder, which enables it to handle one specific task under a certain data type and context type. The multi - flow structure allows it to support multiple tasks simultaneously.

📄 License

This model is licensed under the MIT license.

Cautions, Biases, and Content Acknowledgment

⚠️ Important Note

We would like to raise the awareness of users of this demo of its potential issues and concerns. Like previous large foundation models, Versatile Diffusion could be problematic in some cases, partially due to the imperfect training data and pretrained network (VAEs / context encoders) with limited scope. In its future research phase, VD may do better on tasks such as text - to - image, image - to - text, etc., with the help of more powerful VAEs, more sophisticated network designs, and more cleaned data. So far, we have kept all features available for research testing both to show the great potential of the VD framework and to collect important feedback to improve the model in the future. We welcome researchers and users to report issues with the HuggingFace community discussion feature or email the authors.

Beware that VD may output content that reinforces or exacerbates societal biases, as well as realistic faces, pornography, and violence. VD was trained on the LAION - 2B dataset, which scraped non - curated online images and text, and may contain unintended exceptions as we removed illegal content. VD in this demo is meant only for research purposes.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご