AltDiffusion Open-Source Image Generation Model - Supports Chinese and English Input, Convert Bilingual Text to Images in Seconds

Altdiffusion

Developed by BAAI

A bilingual text-to-image generation model developed based on AltCLIP and Stable Diffusion framework, supporting Chinese and English input

Text-to-Image Supports Multiple LanguagesOpen Source License:Openrail #Bilingual image generation #Chinese-English semantic alignment #Artistic concept creation

Downloads 26

Release Time : 11/15/2022

Model Overview

AltDiffusion is a multilingual text-to-image generation model based on the Stable Diffusion architecture, specifically optimized for Chinese-English bilingual generation capabilities. The model retains all the features of the original Stable Diffusion and performs better in certain scenarios.

Model Features

Bilingual generation capability

Supports Chinese and English input to generate high-quality images with semantic alignment

Retains original features

Fully retains all features of Stable Diffusion, with better performance in certain scenarios

Business-friendly license

Uses OpenRAIL-M license, allowing commercial use and weight redistribution

Model Capabilities

Text-to-image generation

Chinese-English bilingual understanding

High-quality image synthesis

Art style transfer

Use Cases

Creative design

Concept art generation

Generate game/film concept art based on text descriptions

Can generate detailed character designs such as 'Dark Elf Princess'

Commercial illustration creation

Quickly generate commercial illustration materials that meet requirements

Supports 8K high-definition detail output

Education and entertainment

Bilingual learning assistance

Generate visual learning materials with Chinese-English comparisons

Achieves semantically consistent bilingual image generation

🚀 AltDiffusion

AltDiffusion is a bilingual multimodal model based on Stable Diffusion, supporting both Chinese and English, and is suitable for text-to-image tasks.

🚀 Quick Start

This README provides a comprehensive guide to AltDiffusion, including model information, usage examples, and more. You can quickly get started with AltDiffusion by following the steps below.

✨ Features

Bilingual Support: Supports both Chinese and English, enabling users from different language backgrounds to use it.
Multimodal Capability: Capable of handling text-to-image tasks, generating high-quality images based on text descriptions.
Online Demo: Supports an online demo, allowing users to try it out directly through the web interface.

📦 Installation

When running the AltDiffusion model for the first time, the following weights will be automatically downloaded from here:

Model name	Size	Description
StableDiffusionSafetyChecker	1.13G	Safety checker for image
AltDiffusion	8.0G	Our bilingual AltDiffusion model
AltCLIP	3.22G	Our bilingual AltCLIP model

💻 Usage Examples

🧨Diffusers Example

AltDiffusion has been added to 🧨Diffusers!

You can run our diffusers example through here in colab.

You can see the documentation page here.

The following example will use the fast DPM scheduler to generate an image in ca. 2 seconds on a V100.

First you should install diffusers main branch and some dependencies:

pip install git+https://github.com/huggingface/diffusers.git torch transformers accelerate sentencepiece

then you can run the following example:

from diffusers import AltDiffusionPipeline, DPMSolverMultistepScheduler
import torch

pipe = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion", torch_dtype=torch.float16, revision="fp16")
pipe = pipe.to("cuda")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

prompt = "黑暗精灵公主，非常详细，幻想，非常详细，数字绘画，概念艺术，敏锐的焦点，插图"
# or in English:
# prompt = "dark elf princess, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and fuji choko and viktoria gavrilenko and hoang lap"

image = pipe(prompt, num_inference_steps=25).images[0]
image.save("./alt.png")

alt

Transformers Example

import os
import torch
import transformers
from transformers import BertPreTrainedModel
from transformers.models.clip.modeling_clip import CLIPPreTrainedModel
from transformers.models.xlm_roberta.tokenization_xlm_roberta import XLMRobertaTokenizer
from diffusers.schedulers import DDIMScheduler, LMSDiscreteScheduler, PNDMScheduler
from diffusers import StableDiffusionPipeline
from transformers import BertPreTrainedModel,BertModel,BertConfig
import torch.nn as nn
import torch
from transformers.models.xlm_roberta.configuration_xlm_roberta import XLMRobertaConfig
from transformers import XLMRobertaModel
from transformers.activations import ACT2FN
from typing import Optional


class RobertaSeriesConfig(XLMRobertaConfig):
    def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2,project_dim=768,pooler_fn='cls',learn_encoder=False, **kwargs):
        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
        self.project_dim = project_dim
        self.pooler_fn = pooler_fn
        # self.learn_encoder = learn_encoder

class RobertaSeriesModelWithTransformation(BertPreTrainedModel):
    _keys_to_ignore_on_load_unexpected = [r"pooler"]
    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
    base_model_prefix = 'roberta'
    config_class= XLMRobertaConfig
    def __init__(self, config):
        super().__init__(config)
        self.roberta = XLMRobertaModel(config)
        self.transformation = nn.Linear(config.hidden_size, config.project_dim)
        self.post_init()
        
    def get_text_embeds(self,bert_embeds,clip_embeds):
        return self.merge_head(torch.cat((bert_embeds,clip_embeds)))

    def set_tokenizer(self, tokenizer):
        self.tokenizer = tokenizer

    def forward(self, input_ids: Optional[torch.Tensor] = None) :
        attention_mask = (input_ids != self.tokenizer.pad_token_id).to(torch.int64)
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        
        projection_state = self.transformation(outputs.last_hidden_state)
        
        return (projection_state,)

model_path_encoder = "BAAI/RobertaSeriesModelWithTransformation"
model_path_diffusion = "BAAI/AltDiffusion"
device = "cuda"

seed = 12345
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path_encoder, use_auth_token=True)
tokenizer.model_max_length = 77

text_encoder = RobertaSeriesModelWithTransformation.from_pretrained(model_path_encoder, use_auth_token=True)
text_encoder.set_tokenizer(tokenizer)
print("text encode loaded")
pipe = StableDiffusionPipeline.from_pretrained(model_path_diffusion,
                                               tokenizer=tokenizer,
                                               text_encoder=text_encoder,
                                               use_auth_token=True,
                                               )
print("diffusion pipeline loaded")
pipe = pipe.to(device)

prompt = "Thirty years old lee evans as a sad 19th century postman. detailed, soft focus, candle light, interesting lights, realistic, oil canvas, character concept art by munkácsy mihály, csók istván, john everett millais, henry meynell rheam, and da vinci"
with torch.no_grad():
    image = pipe(prompt, guidance_scale=7.5).images[0]  
    
image.save("3.png")

More parameters of predict_generate_images for you to adjust for predict_generate_images are listed below:

Parameter	Type	Description
prompt	str	The prompt text
out_path	str	The output path to save images
n_samples	int	Number of images to be generate
skip_grid	bool	If set to true, image gridding step will be skipped
ddim_step	int	Number of steps in ddim model
plms	bool	If set to true, PLMS Sampler instead of DDIM Sampler will be applied
scale	float	This value determines how important the prompt incluences generate images
H	int	Height of image
W	int	Width of image
C	int	Numeber of channels of generated images
seed	int	Random seed number

⚠️ Important Note

The model inference requires a GPU of at least 10G above.

📚 Documentation

Model Information

We used AltCLIP, and trained a bilingual Diffusion model based on Stable Diffusion, with training data from WuDao dataset and LAION.

Our model performs well in aligning Chinese and English, and is the strongest open source version on the market today, retaining most of the stable diffusion capabilities of the original, and in some cases even better than the original model.

AltDiffusion model is backed by a bilingual CLIP model named AltCLIP, which is also accessible in FlagAI. You can read this tutorial for more information.

AltDiffusion now supports online demo, try out it by clicking here!

Model Details

Name	Task	Language(s)	Model	Github
AltDiffusion	Multimodal	Chinese&English	Stable Diffusion	FlagAI

Gradio

We support a Gradio Web UI to run AltDiffusion:

More Results

Chinese and English alignment ability

English prompt: dark elf princess, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and fuji choko and viktoria gavrilenko and hoang lap
Chinese prompt: 黑暗精灵公主，非常详细，幻想，非常详细，数字绘画，概念艺术，敏锐的焦点，插图

The performance for Chinese prompts

Prompt: 带墨镜的男孩肖像，充满细节，8K高清
Prompt: 带墨镜的中国男孩肖像，充满细节，8K高清

The ability to generate long images

Prompt: 一只带着帽子的小狗
Original stable diffusion:
Ours:

💡 Usage Tip

The long image generation technology here is provided by Right Brain Technology.

Number of Model Parameters

Module Name	Number of Parameters
(Details of module parameters are not provided in the original text, so this part is skipped)

🔧 Technical Details

We used AltCLIP, and trained a bilingual Diffusion model based on Stable Diffusion, with training data from WuDao dataset and LAION.

📄 License

This model is open access and available to all, with a CreativeML OpenRAIL-M license further specifying rights and usage. The CreativeML OpenRAIL License specifies:

You can't use the model to deliberately produce nor share illegal or harmful outputs or content
The authors claim no rights on the outputs you generate, you are free to use them and are accountable for their use which must not go against the provisions set in the license
You may re-distribute the weights and use the model commercially and/or as a service. If you do, please be aware you have to include the same use restrictions as the ones in the license and share a copy of the CreativeML OpenRAIL-M to all your users (please read the license entirely and carefully) Please read the full license carefully here: https://huggingface.co/spaces/CompVis/stable-diffusion-license

📖 Citation

If you find this work helpful, please consider to cite

@article{https://doi.org/10.48550/arxiv.2211.06679,
  doi = {10.48550/ARXIV.2211.06679},
  url = {https://arxiv.org/abs/2211.06679},
  author = {Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
  title = {AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご