AltDiffusion-m9開源多語言圖像生成模型 - 支持9種語言文本轉圖像

Altdiffusion M9

由BAAI開發

AltDiffusion-m9是基於Stable Diffusion框架的多語言文本到圖像生成模型，支持9種語言，由AltCLIP-m9多語言CLIP模型提供支持。

文本生成圖像支持多種語言開源協議:Openrail #多語言文生圖 #跨語言對齊 #高保真生成

下載量 46

發布時間 : 11/18/2022

模型概述

AltDiffusion-m9是一個多語言文本到圖像生成模型，基於Stable Diffusion框架，採用AltCLIP-m9多語言CLIP模型，使用悟道數據集和LAION數據進行訓練。該模型在多語言對齊方面表現卓越，是目前開源領域最強的多語言文本到圖像模型之一。

模型特點

多語言支持

支持9種語言的文本到圖像生成，包括英語、中文、西班牙語等。

高質量圖像生成

在多語言對齊方面表現卓越，部分案例中展現出比原版Stable Diffusion更優的生成效果。

商業用途友好

允許商業用途及模型權重再分發，但須包含相同使用限制並向所有用戶提供許可證副本。

模型能力

文本到圖像生成

多語言文本理解

高質量圖像合成

使用案例

創意設計

角色設計

根據多語言文本描述生成角色圖像，如'黑暗精靈公主'。

生成具有詳細幻想風格的角色圖像。

場景設計

根據文本描述生成特定場景的圖像。

生成符合描述的詳細場景圖像。

藝術創作

數字繪畫

根據藝術家的描述生成數字繪畫作品。

生成具有藝術價值的數字繪畫。

🚀 AltDiffusion

AltDiffusion是一個多語言文本到圖像的擴散模型，支持多種語言，可生成高質量的圖像，在多語言對齊方面表現出色，保留了原版Stable Diffusion的大部分能力。

🚀 快速開始

Gradio使用

我們支持通過 Gradio Web UI 運行 AltDiffusion-m9：

模型權重下載

第一次運行AltDiffusion-m9模型時會自動從huggingface下載如下權重：

模型名稱 Model name	大小 Size	描述 Description
StableDiffusionSafetyChecker	1.13G	圖片的安全檢查器；Safety checker for image
AltDiffusion-m9	8.0G	支持英語(En)、中文(Zh)、西班牙語(Es)、法語(Fr)、俄語(Ru)、日語(Ja)、韓語(Ko)、阿拉伯語(Ar)和意大利語(It)
AltCLIP-m9	3.22G	支持英語(En)、中文(Zh)、西班牙語(Es)、法語(Fr)、俄語(Ru)、日語(Ja)、韓語(Ko)、阿拉伯語(Ar)和意大利語(It)

示例代碼運行

🧨Diffusers示例

AltDiffusion-m9 已被添加到 🧨Diffusers！我們的代碼示例已放到colab上，歡迎使用。您可以在此處查看文檔頁面。

以下示例將使用fast DPM調度程序生成圖像，在V100上耗時大約為2秒。

from diffusers import AltDiffusionPipeline, DPMSolverMultistepScheduler
import torch

pipe = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9", torch_dtype=torch.float16, revision="fp16")
pipe = pipe.to("cuda")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

prompt = "黑暗精靈公主，非常詳細，幻想，非常詳細，數字繪畫，概念藝術，敏銳的焦點，插圖"
# or in English:
# prompt = "dark elf princess, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and fuji choko and viktoria gavrilenko and hoang lap"

image = pipe(prompt, num_inference_steps=25).images[0]
image.save("./alt.png")

alt

Transformers示例

import os
import torch
import transformers
from transformers import BertPreTrainedModel
from transformers.models.clip.modeling_clip import CLIPPreTrainedModel
from transformers.models.xlm_roberta.tokenization_xlm_roberta import XLMRobertaTokenizer
from diffusers.schedulers import DDIMScheduler, LMSDiscreteScheduler, PNDMScheduler
from diffusers import StableDiffusionPipeline
from transformers import BertPreTrainedModel,BertModel,BertConfig
import torch.nn as nn
import torch
from transformers.models.xlm_roberta.configuration_xlm_roberta import XLMRobertaConfig
from transformers import XLMRobertaModel
from transformers.activations import ACT2FN
from typing import Optional


class RobertaSeriesConfig(XLMRobertaConfig):
    def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2,project_dim=768,pooler_fn='cls',learn_encoder=False, **kwargs):
        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
        self.project_dim = project_dim
        self.pooler_fn = pooler_fn
        # self.learn_encoder = learn_encoder

class RobertaSeriesModelWithTransformation(BertPreTrainedModel):
    _keys_to_ignore_on_load_unexpected = [r"pooler"]
    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
    base_model_prefix = 'roberta'
    config_class= XLMRobertaConfig
    def __init__(self, config):
        super().__init__(config)
        self.roberta = XLMRobertaModel(config)
        self.transformation = nn.Linear(config.hidden_size, config.project_dim)
        self.post_init()
        
    def get_text_embeds(self,bert_embeds,clip_embeds):
        return self.merge_head(torch.cat((bert_embeds,clip_embeds)))

    def set_tokenizer(self, tokenizer):
        self.tokenizer = tokenizer

    def forward(self, input_ids: Optional[torch.Tensor] = None) :
        attention_mask = (input_ids != self.tokenizer.pad_token_id).to(torch.int64)
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        
        projection_state = self.transformation(outputs.last_hidden_state)
        
        return (projection_state,)

model_path_encoder = "BAAI/RobertaSeriesModelWithTransformation"
model_path_diffusion = "BAAI/AltDiffusion-m9"
device = "cuda"

seed = 12345
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path_encoder, use_auth_token=True)
tokenizer.model_max_length = 77

text_encoder = RobertaSeriesModelWithTransformation.from_pretrained(model_path_encoder, use_auth_token=True)
text_encoder.set_tokenizer(tokenizer)
print("text encode loaded")
pipe = StableDiffusionPipeline.from_pretrained(model_path_diffusion,
                                               tokenizer=tokenizer,
                                               text_encoder=text_encoder,
                                               use_auth_token=True,
                                               )
print("diffusion pipeline loaded")
pipe = pipe.to(device)

prompt = "Thirty years old lee evans as a sad 19th century postman. detailed, soft focus, candle light, interesting lights, realistic, oil canvas, character concept art by munkácsy mihály, csók istván, john everett millais, henry meynell rheam, and da vinci"
with torch.no_grad():
    image = pipe(prompt, guidance_scale=7.5).images[0]  
    
image.save("3.png")

您可以在predict_generate_images函數里通過改變參數來調整設置，具體信息如下：

參數名 Parameter	類型 Type	描述 Description
prompt	str	提示文本; The prompt text
out_path	str	輸出路徑; The output path to save images
n_samples	int	輸出圖片數量; Number of images to be generate
skip_grid	bool	如果為True，會將所有圖片拼接在一起，輸出一張新的圖片; If set to true, image gridding step will be skipped
ddim_step	int	DDIM模型的步數; Number of steps in ddim model
plms	bool	如果為True，則會使用plms模型; If set to true, PLMS Sampler instead of DDIM Sampler will be applied
scale	float	這個值決定了文本在多大程度上影響生成的圖片，值越大影響力越強; This value determines how important the prompt incluences generate images
H	int	圖片的高度; Height of image
W	int	圖片的寬度; Width of image
C	int	圖片的channel數; Numeber of channels of generated images
seed	int	隨機種子; Random seed number

⚠️ 重要提示

模型推理要求一張至少10G以上的GPU。

✨ 主要特性

多語言支持：支持英語(En)、中文(Zh)、西班牙語(Es)、法語(Fr)、俄語(Ru)、日語(Ja)、韓語(Ko)、阿拉伯語(Ar)和意大利語(It)等多種語言。
多模態任務：可用於多模態任務，在多語言對齊方面表現出色。
強大性能：保留了原版Stable Diffusion的大部分能力，在某些例子上比原版模型更出色。

📦 安裝指南

First you should install diffusers main branch and some dependencies:

pip install git+https://github.com/huggingface/diffusers.git torch transformers accelerate sentencepiece

📚 詳細文檔

模型信息

我們使用 AltCLIP-m9，基於 Stable Diffusion 訓練了雙語Diffusion模型，訓練數據來自 WuDao數據集和 LAION 。

我們的版本在多語言對齊方面表現非常出色，是目前市面上開源的最強多語言版本，保留了原版stable diffusion的大部分能力，並且在某些例子上比有著比原版模型更出色的能力。

AltDiffusion-m9 模型由名為 AltCLIP-m9 的多語 CLIP 模型支持，該模型也可在本項目中訪問。您可以閱讀此教程瞭解更多信息。

模型參數量

模塊名稱 Module Name	參數量 Number of Parameters
AutoEncoder	83.7M
Unet	865M
AltCLIP-m9 TextEncoder	859M

🔧 技術細節

關於AltCLIP-m9，我們已經推出了相關報告，有更多細節可以查閱，如對您的工作有幫助，歡迎引用。

@article{https://doi.org/10.48550/arxiv.2211.06679,
  doi = {10.48550/ARXIV.2211.06679},
  url = {https://arxiv.org/abs/2211.06679},
  author = {Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
  title = {AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Please cite our paper if you find it helpful :)

@misc{ye2023altdiffusion,
      title={AltDiffusion: A Multilingual Text-to-Image Diffusion Model}, 
      author={Fulong Ye and Guang Liu and Xinya Wu and Ledell Wu},
      year={2023},
      eprint={2308.09991},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 許可證

該模型通過 CreativeML Open RAIL-M license 獲得許可。作者對您生成的輸出不主張任何權利，您可以自由使用它們並對它們的使用負責，不得違反本許可中的規定。該許可證禁止您分享任何違反任何法律、對他人造成傷害、傳播任何可能造成傷害的個人信息、傳播錯誤信息和針對弱勢群體的任何內容。您可以出於商業目的修改和使用模型，但必須包含相同使用限制的副本。有關限制的完整列表，請閱讀許可證。

模型基本信息

名稱 Name	任務 Task	語言 Language(s)	模型 Model	Github
AltDiffusion-m9	多模態 Multimodal	Multilingual	Stable Diffusion	FlagAI

Altdiffusion M9

模型概述

模型特點

模型能力

使用案例

🚀 AltDiffusion

🚀 快速開始

Gradio使用

模型權重下載

示例代碼運行

🧨Diffusers示例

Transformers示例

✨ 主要特性

📦 安裝指南

📚 詳細文檔

模型信息

模型參數量

🔧 技術細節

📄 許可證

模型基本信息

更多生成結果

多語言示例

中英文對齊能力

prompt:dark elf princess, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and fuji choko and viktoria gavrilenko and hoang lap

英文生成結果/Generated results from English prompts

prompt:黑暗精靈公主，非常詳細，幻想，非常詳細，數字繪畫，概念藝術，敏銳的焦點，插圖

中文生成結果/Generated results from Chinese prompts

中文表現能力

prompt:帶墨鏡的男孩肖像，充滿細節，8K高清

prompt:帶墨鏡的中國男孩肖像，充滿細節，8K高清

長圖生成能力

prompt: 一隻帶著帽子的小狗

原版 stable diffusion：

Ours: