AltDiffusion - m9オープンソース多言語画像生成モデル - 9種類の言語のテキストを画像に変換することをサポート

Altdiffusion M9

BAAIによって開発

AltDiffusion-m9はStable Diffusionフレームワークに基づく多言語テキストから画像生成モデルで、9言語をサポートし、AltCLIP-m9多言語CLIPモデルによって駆動されます。

テキスト生成画像複数言語対応オープンソースライセンス:Openrail #多言語テキストから画像生成 #言語間アライメント #高忠実度生成

ダウンロード数 46

リリース時間 : 11/18/2022

モデル概要

AltDiffusion-m9は多言語テキストから画像生成モデルで、Stable Diffusionフレームワークに基づき、AltCLIP-m9多言語CLIPモデルを採用し、悟道データセットとLAIONデータでトレーニングされています。このモデルは言語間アライメントにおいて優れた性能を発揮し、現在オープンソース領域で最も強力な多言語テキストから画像モデルの一つです。

モデル特徴

多言語サポート

英語、中国語、スペイン語など9言語のテキストから画像生成をサポートします。

高品質画像生成

言語間アライメントにおいて優れた性能を発揮し、一部のケースではオリジナルのStable Diffusionよりも優れた生成効果を示します。

商用利用に適している

商用利用やモデル重みの再配布を許可していますが、同じ使用制限を含め、すべてのユーザーにライセンスのコピーを提供する必要があります。

モデル能力

テキストから画像生成

多言語テキスト理解

高品質画像合成

使用事例

クリエイティブデザイン

キャラクターデザイン

多言語テキスト記述に基づいてキャラクター画像を生成します。例えば'ダークエルフの王女'など。

詳細なファンタジースタイルのキャラクター画像を生成します。

シーンデザイン

テキスト記述に基づいて特定のシーンの画像を生成します。

記述に合致する詳細なシーン画像を生成します。

アート創作

デジタルペインティング

アーティストの記述に基づいてデジタルペインティング作品を生成します。

芸術的価値のあるデジタルペインティングを生成します。

🚀 AltDiffusion

AltDiffusionは、多言語対応のテキストから画像への拡散モデルで、多言語の入力に応じて高品質な画像を生成します。

🚀 クイックスタート

このモデルにアクセスする前に、必ずライセンスをお読みください。

このモデルはオープンアクセスで、すべての人が利用できます。CreativeML OpenRAIL - Mライセンスにより、権利と使用方法がさらに明確に規定されています。

CreativeML OpenRAILライセンスでは以下のことが規定されています：

モデルを使用して、違法または有害な出力やコンテンツを意図的に生成または共有してはいけません。
作成者は、あなたが生成した出力に対して何らの権利も主張しません。あなたはそれらを自由に使用できますが、その使用について責任を負い、ライセンスに定められた規定に違反してはいけません。
重みを再配布し、モデルを商業的に使用したり、サービスとして提供したりすることができます。その場合、ライセンスに記載されているのと同じ使用制限を含め、CreativeML OpenRAIL - Mのコピーをすべてのユーザーに共有する必要があります（ライセンス全体を注意深くお読みください）。

✨ 主な機能

多言語対応：英語（En）、中国語（Zh）、スペイン語（Es）、フランス語（Fr）、ロシア語（Ru）、日本語（Ja）、韓国語（Ko）、アラビア語（Ar）、イタリア語（It）などの多言語に対応しています。
高品質な画像生成：多言語の入力に応じて、高品質な画像を生成することができます。

📦 インストール

🧨Diffusersを使用する場合

最初に、diffusersのメインブランチといくつかの依存関係をインストールする必要があります。

pip install git+https://github.com/huggingface/diffusers.git torch transformers accelerate sentencepiece

💻 使用例

🧨Diffusers Example

AltDiffusion - m9 は 🧨Diffusersに追加されています！

私たちのコード例はcolabにありますので、ご利用ください。

ドキュメントページはこちらで確認できます。

以下の例では、高速DPMスケジューラを使用して画像を生成します。V100で約2秒で生成されます。

from diffusers import AltDiffusionPipeline, DPMSolverMultistepScheduler
import torch

pipe = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9", torch_dtype=torch.float16, revision="fp16")
pipe = pipe.to("cuda")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

prompt = "黑暗精灵公主，非常详细，幻想，非常详细，数字绘画，概念艺术，敏锐的焦点，插图"
# or in English:
# prompt = "dark elf princess, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and fuji choko and viktoria gavrilenko and hoang lap"

image = pipe(prompt, num_inference_steps=25).images[0]
image.save("./alt.png")

alt

Transformers Example

import os
import torch
import transformers
from transformers import BertPreTrainedModel
from transformers.models.clip.modeling_clip import CLIPPreTrainedModel
from transformers.models.xlm_roberta.tokenization_xlm_roberta import XLMRobertaTokenizer
from diffusers.schedulers import DDIMScheduler, LMSDiscreteScheduler, PNDMScheduler
from diffusers import StableDiffusionPipeline
from transformers import BertPreTrainedModel,BertModel,BertConfig
import torch.nn as nn
import torch
from transformers.models.xlm_roberta.configuration_xlm_roberta import XLMRobertaConfig
from transformers import XLMRobertaModel
from transformers.activations import ACT2FN
from typing import Optional


class RobertaSeriesConfig(XLMRobertaConfig):
    def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2,project_dim=768,pooler_fn='cls',learn_encoder=False, **kwargs):
        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
        self.project_dim = project_dim
        self.pooler_fn = pooler_fn
        # self.learn_encoder = learn_encoder

class RobertaSeriesModelWithTransformation(BertPreTrainedModel):
    _keys_to_ignore_on_load_unexpected = [r"pooler"]
    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
    base_model_prefix = 'roberta'
    config_class= XLMRobertaConfig
    def __init__(self, config):
        super().__init__(config)
        self.roberta = XLMRobertaModel(config)
        self.transformation = nn.Linear(config.hidden_size, config.project_dim)
        self.post_init()
        
    def get_text_embeds(self,bert_embeds,clip_embeds):
        return self.merge_head(torch.cat((bert_embeds,clip_embeds)))

    def set_tokenizer(self, tokenizer):
        self.tokenizer = tokenizer

    def forward(self, input_ids: Optional[torch.Tensor] = None) :
        attention_mask = (input_ids != self.tokenizer.pad_token_id).to(torch.int64)
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        
        projection_state = self.transformation(outputs.last_hidden_state)
        
        return (projection_state,)

model_path_encoder = "BAAI/RobertaSeriesModelWithTransformation"
model_path_diffusion = "BAAI/AltDiffusion-m9"
device = "cuda"

seed = 12345
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path_encoder, use_auth_token=True)
tokenizer.model_max_length = 77

text_encoder = RobertaSeriesModelWithTransformation.from_pretrained(model_path_encoder, use_auth_token=True)
text_encoder.set_tokenizer(tokenizer)
print("text encode loaded")
pipe = StableDiffusionPipeline.from_pretrained(model_path_diffusion,
                                               tokenizer=tokenizer,
                                               text_encoder=text_encoder,
                                               use_auth_token=True,
                                               )
print("diffusion pipeline loaded")
pipe = pipe.to(device)

prompt = "Thirty years old lee evans as a sad 19th century postman. detailed, soft focus, candle light, interesting lights, realistic, oil canvas, character concept art by munkácsy mihály, csók istván, john everett millais, henry meynell rheam, and da vinci"
with torch.no_grad():
    image = pipe(prompt, guidance_scale=7.5).images[0]  
    
image.save("3.png")

predict_generate_images関数でパラメータを変更することで設定を調整できます。詳細は以下の通りです。

パラメータ名 Parameter	タイプ Type	説明 Description
prompt	str	プロンプトテキスト; The prompt text
out_path	str	出力パス; The output path to save images
n_samples	int	出力画像の数; Number of images to be generate
skip_grid	bool	Trueの場合、すべての画像を結合して1つの新しい画像を出力します; If set to true, image gridding step will be skipped
ddim_step	int	DDIMモデルのステップ数; Number of steps in ddim model
plms	bool	Trueの場合、plmsモデルを使用します; If set to true, PLMS Sampler instead of DDIM Sampler will be applied
scale	float	この値は、テキストが生成される画像にどの程度影響を与えるかを決定します。値が大きいほど影響力が強くなります; This value determines how important the prompt incluences generate images
H	int	画像の高さ; Height of image
W	int	画像の幅; Width of image
C	int	画像のチャンネル数; Numeber of channels of generated images
seed	int	ランダムシード; Random seed number

⚠️ 重要提示

モデルの推論には、少なくとも10G以上のGPUが必要です。

📚 ドキュメント

モデル情報

私たちは AltCLIP - m9 を使用し、Stable Diffusion をベースに二言語拡散モデルをトレーニングしました。トレーニングデータは WuDaoデータセットと [LAION](https://huggingface.co/datasets/laion/laion2B - en) から取得しました。

私たちのバージョンは、多言語のアラインメントに非常に優れており、現在市販されているオープンソースの中で最強の多言語バージョンです。元のstable diffusionの大部分の機能を保持し、一部の例では元のモデルよりも優れた性能を発揮します。

AltDiffusion - m9モデルは、AltCLIP - m9という多言語CLIPモデルによってサポートされており、これはFlagAIでもアクセス可能です。詳細については、このチュートリアルを読んでください。

モデルの重み

AltDiffusion - m9モデルを初めて実行すると、以下の重みがHFから自動的にダウンロードされます。

モデル名 Model name	サイズ Size	説明 Description
StableDiffusionSafetyChecker	1.13G	画像の安全チェッカー; Safety checker for image
AltDiffusion - m9	8.0G	英語（En）、中国語（Zh）、スペイン語（Es）、フランス語（Fr）、ロシア語（Ru）、日本語（Ja）、韓国語（Ko）、アラビア語（Ar）、イタリア語（It）をサポート
AltCLIP - m9	3.22G	英語（En）、中国語（Zh）、スペイン語（Es）、フランス語（Fr）、ロシア語（Ru）、日本語（Ja）、韓国語（Ko）、アラビア語（Ar）、イタリア語（It）をサポート

モデルの概要

名称 Name	タスク Task	言語 Language(s)	モデル Model	Github
AltDiffusion - m9	マルチモーダル Multimodal	多言語 Multilingual	Stable Diffusion	FlagAI

Gradio

私たちは [Gradio](https://github.com/gradio - app/gradio) Web UIをサポートしており、AltDiffusion - m9を実行できます。 [ Open In Spaces ](https://huggingface.co/spaces/akhaliq/AltDiffusion - m9)

生成結果の例

多言語の例

同じプロンプトを異なる言語で入力すると、異なる顔が生成されます！

中英語のアラインメント能力

プロンプト: dark elf princess, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and fuji choko and viktoria gavrilenko and hoang lap

英語のプロンプトから生成された結果

![image](https://raw.githubusercontent.com/BAAI - OpenPlatform/test_open/main/en_dark_elf.png)

プロンプト: 黑暗精灵公主，非常详细，幻想，非常详细，数字绘画，概念艺术，敏锐的焦点，插图

中国語のプロンプトから生成された結果

![image](https://raw.githubusercontent.com/BAAI - OpenPlatform/test_open/main/cn_dark_elf.png)

中国語の性能

プロンプト: 带墨镜的男孩肖像，充满细节，8K高清

![image](https://raw.githubusercontent.com/BAAI - OpenPlatform/test_open/main/boy.png)

プロンプト: 带墨镜的中国男孩肖像，充满细节，8K高清

![image](https://raw.githubusercontent.com/BAAI - OpenPlatform/test_open/main/cn_boy.png)

長い画像の生成能力

プロンプト: 一只带着帽子的小狗

元のstable diffusion

![image](https://raw.githubusercontent.com/BAAI - OpenPlatform/test_open/main/dog_other.png)

私たちのモデル

![image](https://raw.githubusercontent.com/BAAI - OpenPlatform/test_open/main/dog.png)

注: ここでの長い画像の生成技術は、Right Brain Technologyによって提供されています。

モデルのパラメータ数

モジュール名 Module Name	パラメータ数 Number of Parameters
AutoEncoder	83.7M
Unet	865M
AltCLIP - m9 TextEncoder	859M

🔧 技術詳細

引用

AltCLIP - m9に関して、私たちは関連するレポートを公開しています。詳細は以下を参照してください。もしあなたの研究に役立つ場合は、引用を検討してください。

@article{https://doi.org/10.48550/arxiv.2211.06679,
  doi = {10.48550/ARXIV.2211.06679},
  url = {https://arxiv.org/abs/2211.06679},
  author = {Chen, Zhongzhi and Liu, Guang and Zhang, Bo - Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
  title = {AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non - exclusive license}
}

もしこの論文が役に立った場合は、引用していただけると幸いです。

@misc{ye2023altdiffusion,
      title={AltDiffusion: A Multilingual Text - to - Image Diffusion Model}, 
      author={Fulong Ye and Guang Liu and Xinya Wu and Ledell Wu},
      year={2023},
      eprint={2308.09991},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 ライセンス

このモデルは [CreativeML Open RAIL - M license](https://huggingface.co/spaces/CompVis/stable - diffusion - license) によってライセンスされています。作成者は、あなたが生成した出力に対して何らの権利も主張しません。あなたはそれらを自由に使用できますが、その使用について責任を負い、ライセンスに定められた規定に違反してはいけません。このライセンスでは、あなたが何らかの法律に違反するコンテンツを共有したり、人に危害を与えたり、危害をもたらす可能性のある個人情報を広めたり、誤情報を拡散したり、脆弱なグループを標的にするコンテンツを作成したりすることを禁止しています。あなたは、モデルを商業目的で変更して使用することができますが、同じ使用制限のコピーを含める必要があります。制限の完全なリストについては、[ライセンスを読む](https://huggingface.co/spaces/CompVis/stable - diffusion - license) を参照してください。