EN

Kohaku XL Zeta

K

Kohaku XL Zeta

Developed by KBlueLeaf

A stable diffusion model based on the Kohaku XL Epsilon rev2 version, supporting text-to-image tasks with improved style and character fidelity, and more style options.

Image Generation EnglishOpen Source License:Other #Anime-style generation #Multi-label understanding #High-resolution output

Downloads 8,356

Release Time : 8/25/2024

Model Overview

Kohaku XL Zeta Edition is a text-to-image model based on the stable diffusion architecture, optimized for more stable generation results and better style reproduction. It supports both natural language descriptions and Danbooru tags as input methods.

Model Features

More Stable Generation Results

No longer requires lengthy/detailed prompts to generate high-quality images.

Better Style and Character Fidelity

Over 2200 characters in a 3700-character set achieve CCIP scores >0.9, surpassing the Sanae XL anime model.

Dual Annotation Training

Trained simultaneously with Danbooru tags and natural language, enhancing understanding of natural language descriptions.

Mixed Dataset Training

Utilizes diverse datasets including Danbooru, Pixiv, PVC figures, and Realbooru, totaling 8.46 million images.

Extended Context Length

Due to dual annotation training, the context length limit is extended to 300.

Model Capabilities

Text-to-Image Generation

Anime-style Image Generation

Character Reproduction

Style Transfer

High-Quality Image Generation

Use Cases

Artistic Creation

Anime Character Design

Generate anime character images based on natural language descriptions.

Highly accurate character representations

Illustration Creation

Generate illustrations in various styles.

Diverse artistic style expressions

Content Creation

Social Media Content

Quickly generate visual content suitable for social media.

High-quality visual materials

🚀 Kohaku XL Zeta

Kohaku XL Zeta is a text - to - image model. It offers enhanced stability, better style and character fidelity, and improved natural language captioning ability. Trained on a large and diverse dataset, it has extended context length and provides high - quality image generation.

Join us: https://discord.gg/tPBsKDyRR5

image/png

✨ Features

Model Inheritance: Resumes from Kohaku - XL - Epsilon rev2.
Stability Improvement: More stable, no longer requiring long/detailed prompts.
High - Fidelity Output: Better fidelity on style and character, supporting more styles. The CCIP metric surpasses Sanae XL anime, with over 2200 characters having a CCIP score > 0.9 in a 3700 - character set.
Captioning Ability: Trained on both danbooru tags and natural language, offering better performance on natural language captioning.
Diverse Training Data: Trained on a combined dataset, not just danbooru. It includes danbooru (7.6M images), pixiv (filtered from 2.6M special set), pvc figure (around 30k images), realbooru (around 90k images for regularization), totaling 8.46M images.
Extended Context Length: Since the model is trained on both types of captions, the context length limit is extended to 300.

image/png

💻 Usage Examples

Recommended Generation Settings

Resolution: 1024x1024 or similar pixel count.
CFG Scale: 3.5 - 6.5.
Sampler/Scheduler:
- Euler (A) / any scheduler.
- DPM++ series / exponential scheduler.
- For other samplers, an exponential scheduler is recommended.
Step: 12 - 50.

Prompt Format

Same as Kohaku XL Epsilon or Delta, but you can replace "general tags" with "natural language caption". You can also use both together.

Special Tags

Quality tags: masterpiece, best quality, great quality, good quality, normal quality, low quality, worst quality.
Rating tags: safe, sensitive, nsfw, explicit.
Date tags: newest, recent, mid, early, old.

Rating tags

General: safe
Sensitive: sensitive
Questionable: nsfw
Explicit: nsfw, explicit

📚 Documentation

Dataset

To enhance the model's ability on certain concepts, the full danbooru dataset is used instead of a filtered one. A crawled Pixiv dataset (from 3 - 5 tags sorted by popularity) is added as an additional dataset. Due to Pixiv's search system limitations, there aren't many meaningful images, and some are duplicated with the danbooru set. However, to reinforce these concepts, the duplication is ignored. Similar to kxl eps rev2, realbooru and pvc figure images are added for more flexibility in concept/style.

Training

Hardware: Quad RTX 3090s. | Property | Details | |----------|---------| | Dataset - Num Images | 8,468,798 | | Dataset - Resolution | 1024x1024 | | Dataset - Min Bucket Resolution | 256 | | Dataset - Max Bucket Resolution | 4096 | | Dataset - Caption Tag Dropout | 0.2 | | Dataset - Caption Group Dropout | 0.2 (for dropping tag/nl caption entirely) | | Training - Batch Size | 4 | | Training - Grad Accumulation Step | 32 | | Training - Equivalent Batch Size | 512 | | Training - Total Epoch | 1 | | Training - Total Steps | 16548 | | Training - Training Time | 430 hours (wall time) | | Training - Mixed Precision | FP16 | | Optimizer - Optimizer | Lion8bit | | Optimizer - Learning Rate | 1e - 5 for UNet / TE training disabled | | Optimizer - LR Scheduler | Constant (with warmup) | | Optimizer - Warmup Steps | 100 | | Optimizer - Weight Decay | 0.1 | | Optimizer - Betas | 0.9, 0.95 | | Diffusion - Min SNR Gamma | 5 | | Diffusion - Debiased Estimation Loss | Enabled | | Diffusion - IP Noise Gamma | 0.05 |

Why do you still use SDXL but not any Brand New DiT - Based Models?

Unless someone provides reasonable compute resources or a team releases an efficient enough DiT, no DiT - based anime base model will be trained. However, if 8xH100 is provided for a year, multiple DiT models can be trained from scratch.

📄 License

Fair - AI - public - 1.0 - sd. You can find more details here.

Stable Diffusion V1 5

Stable Diffusion is a latent text-to-image diffusion model capable of generating realistic images from any text input.

Image Generation

stable-diffusion-v1-5

Stable Diffusion Inpainting

A text-to-image generation model based on stable diffusion with image inpainting capabilities

Image Generation

stable-diffusion-v1-5

Stable Diffusion Xl Base 1.0

SDXL 1.0 is a diffusion-based text-to-image generation model that employs an expert-integrated latent diffusion process, supporting high-resolution image generation

Image Generation

Stable Diffusion V1 4

Stable Diffusion is a latent text-to-image diffusion model capable of generating realistic images from any text input.

Image Generation

Stable Diffusion Xl Refiner 1.0

The SD-XL 1.0 Refiner Model is an image generation model developed by Stability AI, designed to enhance the quality of images generated by the SDXL base model, with particular expertise in the final denoising step.

Image Generation

Stable Diffusion 2 1

A diffusion-based text-to-image generation model that supports image generation and modification through text prompts

Image Generation

Stable Diffusion Xl 1.0 Inpainting 0.1

A latent text-to-image diffusion model based on Stable Diffusion XL, capable of image inpainting via masks

Image Generation

Stable Diffusion 2 Base

Diffusion-based text-to-image model capable of generating high-quality images from text prompts

Image Generation

Playground V2.5 1024px Aesthetic

An open-source text-to-image model capable of generating aesthetic images at 1024x1024 resolution and various aspect ratios, leading in aesthetic quality within the open-source domain.

Image Generation

SD-Turbo is a high-speed text-to-image model capable of generating realistic images from text prompts with just a single network inference. Released as a research prototype, it aims to explore compact distilled text-to-image models.

Image Generation

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase