🚀 Kohaku XL Zeta
Kohaku XL Zeta is a text - to - image model. It offers enhanced stability, better style and character fidelity, and improved natural language captioning ability. Trained on a large and diverse dataset, it has extended context length and provides high - quality image generation.
Join us: https://discord.gg/tPBsKDyRR5

✨ Features
- Model Inheritance: Resumes from Kohaku - XL - Epsilon rev2.
- Stability Improvement: More stable, no longer requiring long/detailed prompts.
- High - Fidelity Output: Better fidelity on style and character, supporting more styles. The CCIP metric surpasses Sanae XL anime, with over 2200 characters having a CCIP score > 0.9 in a 3700 - character set.
- Captioning Ability: Trained on both danbooru tags and natural language, offering better performance on natural language captioning.
- Diverse Training Data: Trained on a combined dataset, not just danbooru. It includes danbooru (7.6M images), pixiv (filtered from 2.6M special set), pvc figure (around 30k images), realbooru (around 90k images for regularization), totaling 8.46M images.
- Extended Context Length: Since the model is trained on both types of captions, the context length limit is extended to 300.

💻 Usage Examples
Recommended Generation Settings
- Resolution: 1024x1024 or similar pixel count.
- CFG Scale: 3.5 - 6.5.
- Sampler/Scheduler:
- Euler (A) / any scheduler.
- DPM++ series / exponential scheduler.
- For other samplers, an exponential scheduler is recommended.
- Step: 12 - 50.
Prompt Format
Same as Kohaku XL Epsilon or Delta, but you can replace "general tags" with "natural language caption". You can also use both together.
Special Tags
- Quality tags: masterpiece, best quality, great quality, good quality, normal quality, low quality, worst quality.
- Rating tags: safe, sensitive, nsfw, explicit.
- Date tags: newest, recent, mid, early, old.
Rating tags
- General: safe
- Sensitive: sensitive
- Questionable: nsfw
- Explicit: nsfw, explicit
📚 Documentation
Dataset
To enhance the model's ability on certain concepts, the full danbooru dataset is used instead of a filtered one. A crawled Pixiv dataset (from 3 - 5 tags sorted by popularity) is added as an additional dataset. Due to Pixiv's search system limitations, there aren't many meaningful images, and some are duplicated with the danbooru set. However, to reinforce these concepts, the duplication is ignored. Similar to kxl eps rev2, realbooru and pvc figure images are added for more flexibility in concept/style.
Training
- Hardware: Quad RTX 3090s.
| Property | Details |
|----------|---------|
| Dataset - Num Images | 8,468,798 |
| Dataset - Resolution | 1024x1024 |
| Dataset - Min Bucket Resolution | 256 |
| Dataset - Max Bucket Resolution | 4096 |
| Dataset - Caption Tag Dropout | 0.2 |
| Dataset - Caption Group Dropout | 0.2 (for dropping tag/nl caption entirely) |
| Training - Batch Size | 4 |
| Training - Grad Accumulation Step | 32 |
| Training - Equivalent Batch Size | 512 |
| Training - Total Epoch | 1 |
| Training - Total Steps | 16548 |
| Training - Training Time | 430 hours (wall time) |
| Training - Mixed Precision | FP16 |
| Optimizer - Optimizer | Lion8bit |
| Optimizer - Learning Rate | 1e - 5 for UNet / TE training disabled |
| Optimizer - LR Scheduler | Constant (with warmup) |
| Optimizer - Warmup Steps | 100 |
| Optimizer - Weight Decay | 0.1 |
| Optimizer - Betas | 0.9, 0.95 |
| Diffusion - Min SNR Gamma | 5 |
| Diffusion - Debiased Estimation Loss | Enabled |
| Diffusion - IP Noise Gamma | 0.05 |
Why do you still use SDXL but not any Brand New DiT - Based Models?
Unless someone provides reasonable compute resources or a team releases an efficient enough DiT, no DiT - based anime base model will be trained. However, if 8xH100 is provided for a year, multiple DiT models can be trained from scratch.
📄 License
Fair - AI - public - 1.0 - sd. You can find more details here.