🚀 Versatile Diffusion V1.0 Model Card
Versatile Diffusion (VD) is the first unified multi - flow multimodal diffusion framework, serving as a step towards Universal Generative AI. It can natively support image - to - text, image - variation, text - to - image, and text - variation. Moreover, it can be extended to other applications like semantic - style disentanglement, image - text dual - guided generation, and latent image - to - text - to - image editing. Future versions will support more modalities such as speech, music, video, and 3D.
For more information, visit GitHub and arXiv.
🚀 Quick Start
This README provides details about the Versatile Diffusion V1.0 model, including its features, usage, and cautions.
✨ Features
- Unified Multi - flow Framework: Supports multiple tasks like image - to - text, image - variation, text - to - image, and text - variation natively.
- Extensible: Can be extended to other applications such as semantic - style disentanglement and image - text dual - guided generation.
- Future - proof: Future versions will support more modalities like speech, music, video, and 3D.
📦 Installation
To use the model, you need to install the necessary libraries.
⚠️ Important Note
Make sure to install transformers
from "main"
in order to use this model.
pip install git+https://github.com/huggingface/transformers
💻 Usage Examples
Basic Usage
You can use the model with the 🧨Diffusers library or the SHI - Labs Versatile Diffusion codebase.
To use Versatile Diffusion for all tasks, it is recommended to use the VersatileDiffusionPipeline
from diffusers import VersatileDiffusionPipeline
import torch
import requests
from io import BytesIO
from PIL import Image
pipe = VersatileDiffusionPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "a red car"
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
image = pipe.text_to_image(prompt).images[0]
image = pipe.image_variation(image).images[0]
image = pipe.dual_guided(prompt, image).images[0]
Advanced Usage
Task Specific Pipelines
The task - specific pipelines load only the weights that are needed onto GPU. You can find all task - specific pipelines here.
Text to Image
from diffusers import VersatileDiffusionTextToImagePipeline
import torch
pipe = VersatileDiffusionTextToImagePipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe.remove_unused_weights()
pipe = pipe.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe("an astronaut riding on a horse on mars", generator=generator).images[0]
image.save("./astronaut.png")
Image variations
from diffusers import VersatileDiffusionImageVariationPipeline
import torch
import requests
from io import BytesIO
from PIL import Image
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
pipe = VersatileDiffusionImageVariationPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe(image, generator=generator).images[0]
image.save("./car_variation.png")
Dual - guided generation
from diffusers import VersatileDiffusionDualGuidedPipeline
import torch
import requests
from io import BytesIO
from PIL import Image
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
text = "a red car in the sun"
pipe = VersatileDiffusionDualGuidedPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe.remove_unused_weights()
pipe = pipe.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(0)
text_to_image_strength = 0.75
image = pipe(prompt=text, image=image, text_to_image_strength=text_to_image_strength, generator=generator).images[0]
image.save("./red_car.png")
Original GitHub Repository
Follow the instructions here.
📚 Documentation
Model Details
One single flow of Versatile Diffusion contains a VAE, a diffuser, and a context encoder, and thus handles one task (e.g., text - to - image) under one data type (e.g., image) and one context type (e.g., text). The multi - flow structure of Versatile Diffusion is shown in the following diagram:
Property |
Details |
Developed by |
Xingqian Xu, Atlas Wang, Eric Zhang, Kai Wang, and Humphrey Shi |
Model Type |
Diffusion - based multimodal generation model |
Language(s) |
English |
License |
MIT |
Resources for more information |
GitHub Repository, Paper |
Cite as |
@article{xu2022versatile, title = {Versatile Diffusion: Text, Images and Variations All in One Diffusion Model}, author = {Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi}, year = 2022, url = {https://arxiv.org/abs/2211.08332}, eprint = {2211.08332}, archiveprefix = {arXiv}, primaryclass = {cs.CV}} |
🔧 Technical Details
One single flow of Versatile Diffusion contains a VAE, a diffuser, and a context encoder, which enables it to handle one specific task under a certain data type and context type. The multi - flow structure allows it to support multiple tasks simultaneously.
📄 License
This model is licensed under the MIT license.
Cautions, Biases, and Content Acknowledgment
⚠️ Important Note
We would like to raise the awareness of users of this demo of its potential issues and concerns. Like previous large foundation models, Versatile Diffusion could be problematic in some cases, partially due to the imperfect training data and pretrained network (VAEs / context encoders) with limited scope. In its future research phase, VD may do better on tasks such as text - to - image, image - to - text, etc., with the help of more powerful VAEs, more sophisticated network designs, and more cleaned data. So far, we have kept all features available for research testing both to show the great potential of the VD framework and to collect important feedback to improve the model in the future. We welcome researchers and users to report issues with the HuggingFace community discussion feature or email the authors.
Beware that VD may output content that reinforces or exacerbates societal biases, as well as realistic faces, pornography, and violence. VD was trained on the LAION - 2B dataset, which scraped non - curated online images and text, and may contain unintended exceptions as we removed illegal content. VD in this demo is meant only for research purposes.