🚀 Mitsua Likes : A Text-to-Image Diffusion Model trained on Opt-In Contributors' "Likes"
Mitsua Likes is a text-to-image latent diffusion model that supports both Japanese and English. It is trained only on data with explicit opt - in permission, open - licensed data, and public domain data, ensuring compliance with licensing and ethical standards.
🚀 Quick Start
Installation
- Install the python packages
pip install transformers sentencepiece diffusers
Verified on the following versions:
transformers==4.44.2
diffusers==0.31.0
sentencepiece==0.2.0
- Run
from diffusers import DiffusionPipeline
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16
pipe = DiffusionPipeline.from_pretrained("Mitsua/mitsua-likes", trust_remote_code=True).to(device, dtype=dtype)
prompt = "滝の中の絵藍ミツア、先生アート"
negative_prompt = "elan doodle, lowres"
ret = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
guidance_scale=5.0,
guidance_rescale=0.7,
width=768,
height=768,
num_inference_steps=40,
)
print("Similarity Restriction:", ret.detected_public_fictional_characters[0])
print("Similarity Measure:")
for k, v in ret.detected_public_fictional_characters_info[0].items():
print(f"{k} : {v:.3%}")
image = ret.images[0]
✨ Features
- Ethical Training: Mitsua Likes is trained solely on opt - in/openly licensed data and public domain data, without using data generated by other AI models. It is Fairly Trained certified, indicating it does not train on copyrighted works without a license.
- Independent Architecture: The entire architecture of the model (CLIP Text Encoder, VAE, UNet) is trained from scratch without using the knowledge of pre - trained models.
- Specific Domain Generation: It struggles with most modern concepts and complex prompts but excels in generating specific types of images, such as simple anime - style portraits and landscapes.
📦 Installation
- Install the required Python packages:
pip install transformers sentencepiece diffusers
Make sure to use the following verified versions:
transformers==4.44.2
diffusers==0.31.0
sentencepiece==0.2.0
💻 Usage Examples
Basic Usage
from diffusers import DiffusionPipeline
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16
pipe = DiffusionPipeline.from_pretrained("Mitsua/mitsua-likes", trust_remote_code=True).to(device, dtype=dtype)
prompt = "滝の中の絵藍ミツア、先生アート"
negative_prompt = "elan doodle, lowres"
ret = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
guidance_scale=5.0,
guidance_rescale=0.7,
width=768,
height=768,
num_inference_steps=40,
)
print("Similarity Restriction:", ret.detected_public_fictional_characters[0])
print("Similarity Measure:")
for k, v in ret.detected_public_fictional_characters_info[0].items():
print(f"{k} : {v:.3%}")
image = ret.images[0]
📚 Documentation
Model Details
Model Architecture
CLIP Text Encoder
- 12 - Layer masked text transformer
- Tokenizer: sentencepiece tokenizer with 64k vocabs
- Max length: 64 tokens
- This text encoder is from Mitsua Japanese CLIP
VAE
- The VAE is trained with fully formula - based Wavelet Loss, ensuring not depending on ImageNet in any kind.
- The VAE decoder is finetuned to embed an invisible watermark in the image, based on our own implementation referring to The Stable Signature.
- Num latent channels: 8
- Note: This repo's VAE encoder weight is initialized to prevent misuse of unauthorized finetuning. If you need VAE encoder weight, please apply from My Mitsua Likes Waitlist Registration.
- Total training steps: 280k steps w/ batch size 240, resolution 256x256, took about 800 RTX4090 hours.
UNet
- The UNet architecture references SDXL's UNet but reduces the number of parameters to fit the relatively small training data size.
- Training procedure: progressive resolution training and aspect bucket training.
- To speed up training, Min - SNR formulation and Immiscible Diffusion technique are applied.
- Total training steps: 550k w/ batch size 216 ~ 1920 depending on resolution.
- All training done on single 8xH100 node and total UNet training took about 2,000 H100 GPU hours.
Character Similarity Determination Model
- This model is a Swin Transformer multi - label classification model finetuned from Swin Base Multi Fractal 1k.
- Training data is a subset of Mitsua Japanese CLIP model. It is an additional post - processing classification model for checking the similarity of generated images to licensed fictional characters.
Intended Use
- Generation of artworks for further creative endeavors
- Research or education on generative models
Out - of - Scope Use
Infringing others' rights in any kind (copyright, publicity right, privacy etc) or causing harm to others is a misuse of this model. This includes, but is not limited to:
- Discriminating against, defaming, or insulting others, thereby damaging their honor or credibility.
- Infringing or potentially infringing the intellectual property rights or privacy of others.
- Disseminating information or content that unjustly harms the interests of others.
- Disseminating false information or content.
Please read Mitsua Likes BY - NC "Prohibitions" for more details.
Trainable model waitlist
The encoder weight of the VAE is initialized in this repository for preventing misuse. Therefore, finetuning using image or image2image is technically disabled, as well as prohibited by the terms.
For non - commercial research or personal creative purposes, you can register for the waitlist to receive full model access with the encoder weight of the VAE.
The training data needs to be owned by your own or explicitly licensed. The training data summary will be disclosed to public.
The other conditions are described in the following Google Form.
My Mitsua Likes Waitlist Registration
🔧 Technical Details
📄 License
The model is licensed under the Mitsua Likes Attribution - NonCommercial License. "Mitsua Likes" attribution is required for sharing generated results. Commercial use is restricted to personal own creative purposes, and using this model for machine learning is prohibited. For corporate commercial use, please contact at this contact form.