Versatile Diffusion开源模型 - 支持图像与文本互转编辑，多模态创作超实用！

首页

Versatile Diffusion

由 shi-labs 开发

首个统一的多流多模态扩散框架，支持图像与文本的相互转换及编辑

文本生成图像开源协议:MIT #多模态扩散框架 #图文双向生成 #解耦式编辑

下载量 8,455

发布时间 : 11/22/2022

模型简介

全能扩散（VD）是一个多模态生成模型，原生支持图像转文本、图像变体、文本转图像及文本变体等多种任务，并能扩展至语义-风格解耦、图文双引导生成等应用场景。

模型特点

多模态统一框架

首个支持图像与文本双向转换及编辑的统一扩散框架

多流结构

通过可组合的流程模块灵活处理不同模态任务

扩展性强

可扩展至语义-风格解耦、双引导生成等高级应用

模型能力

文本生成图像

图像生成变体

图像描述生成

图文混合引导生成

潜在空间编辑

使用案例

创意设计

概念艺术生成

根据文字描述生成科幻场景（如'火星上骑马的宇航员'）

生成符合语义的创意图像

图像编辑

风格转换

通过双引导生成改变图像风格（如将普通汽车变为'阳光下的红色汽车'）

保持内容一致性的风格化输出

🚀 多功能扩散模型V1.0项目介绍

多功能扩散模型（Versatile Diffusion，VD）是首个统一的多流多模态扩散框架，是迈向通用生成式人工智能的重要一步。该模型原生支持图像转文本、图像变体生成、文本转图像和文本变体生成等功能，还可进一步扩展到语义风格解耦、图文双引导生成、潜在图像 - 文本 - 图像编辑等应用场景。未来版本将支持更多模态，如语音、音乐、视频和3D。

更多信息请访问：GitHub，arXiv。

✨ 主要特性

首个统一的多流多模态扩散框架，迈向通用生成式人工智能。
原生支持图像转文本、图像变体生成、文本转图像和文本变体生成等功能。
可扩展到语义风格解耦、图文双引导生成、潜在图像 - 文本 - 图像编辑等应用场景。
未来版本将支持更多模态，如语音、音乐、视频和3D。

📦 安装指南

使用Diffusers库

要使用此模型，需确保从 "main" 安装 transformers：

pip install git+https://github.com/huggingface/transformers

💻 使用示例

基础用法

使用 VersatileDiffusionPipeline 进行通用任务：

#! pip install git+https://github.com/huggingface/transformers diffusers torch
from diffusers import VersatileDiffusionPipeline
import torch
import requests
from io import BytesIO
from PIL import Image

pipe = VersatileDiffusionPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# prompt
prompt = "a red car"

# initial image
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

# text to image
image = pipe.text_to_image(prompt).images[0]

# image variation
image = pipe.image_variation(image).images[0]

# image variation
image = pipe.dual_guided(prompt, image).images[0]

高级用法

文本转图像

from diffusers import VersatileDiffusionTextToImagePipeline
import torch

pipe = VersatileDiffusionTextToImagePipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe.remove_unused_weights()
pipe = pipe.to("cuda")

generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe("an astronaut riding on a horse on mars", generator=generator).images[0]
image.save("./astronaut.png")

图像变体生成

from diffusers import VersatileDiffusionImageVariationPipeline
import torch
import requests
from io import BytesIO
from PIL import Image

# download an initial image
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

pipe = VersatileDiffusionImageVariationPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe(image, generator=generator).images[0]
image.save("./car_variation.png")

双引导生成

from diffusers import VersatileDiffusionDualGuidedPipeline
import torch
import requests
from io import BytesIO
from PIL import Image

# download an initial image
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"

response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
text = "a red car in the sun"

pipe = VersatileDiffusionDualGuidedPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe.remove_unused_weights()
pipe = pipe.to("cuda")

generator = torch.Generator(device="cuda").manual_seed(0)
text_to_image_strength = 0.75

image = pipe(prompt=text, image=image, text_to_image_strength=text_to_image_strength, generator=generator).images[0]
image.save("./red_car.png")

原GitHub仓库使用

请遵循此处的说明。

📚 详细文档

模型详情

多功能扩散模型的单一流包含一个变分自编码器（VAE）、一个扩散器和一个上下文编码器，因此可以在一种数据类型（如图像）和一种上下文类型（如文本）下处理一个任务（如文本转图像）。多功能扩散模型的多流结构如下图所示：

属性	详情
开发者	Xingqian Xu, Atlas Wang, Eric Zhang, Kai Wang, and Humphrey Shi
模型类型	基于扩散的多模态生成模型
语言	英语
许可证	MIT
更多信息资源	GitHub仓库，论文
引用格式

      @article{xu2022versatile,
      	title        = {Versatile Diffusion: Text, Images and Variations All in One Diffusion Model},
      	author       = {Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi},
      	year         = 2022,
      	url          = {https://arxiv.org/abs/2211.08332},
      	eprint       = {2211.08332},
      	archiveprefix = {arXiv},
      	primaryclass = {cs.CV}
      }

🔧 技术细节

多功能扩散模型的一个单一流由VAE、扩散器和上下文编码器组成，可处理一种数据类型和上下文类型下的一个任务。多流结构使其能够支持多种任务和应用场景。

📄 许可证

本项目采用MIT许可证。

⚠️ 重要提示

我们希望使用此演示的用户意识到其潜在的问题和担忧。与之前的大型基础模型一样，多功能扩散模型在某些情况下可能存在问题，部分原因是训练数据不完善以及预训练网络（VAE / 上下文编码器）的范围有限。在未来的研究阶段，借助更强大的VAE、更复杂的网络设计和更干净的数据，多功能扩散模型在文本转图像、图像转文本等任务上可能会表现得更好。到目前为止，我们保留了所有功能用于研究测试，既为了展示多功能扩散框架的巨大潜力，也为了收集重要反馈以在未来改进模型。我们欢迎研究人员和用户通过HuggingFace社区讨论功能报告问题或给作者发送电子邮件。

请注意，多功能扩散模型可能会输出强化或加剧社会偏见的内容，以及逼真的人脸、色情和暴力内容。该模型在LAION - 2B数据集上进行训练，该数据集抓取了未经整理的在线图像和文本，尽管我们删除了非法内容，但仍可能包含意外异常。此演示中的多功能扩散模型仅用于研究目的。