🚀 DiT-Wikiart Model
This model is a DiT (diffusion transformer) designed for unconditional image generation. It can generate art images based on art genre and art style, trained on the Wikiart dataset.
✨ Features
- Genre and Style Awareness: Capable of understanding art genres and styles to generate relevant art images.
- Multiple Model Variants: Offers three different size variants (S, B, L) to meet various needs.
- Trained on Wikiart: Utilizes the Wikiart dataset for training, focusing on art image generation.
📦 Installation
To use the model, you need to install the "huggingface_hub" library and download modeling_dit_wikiart.py
from the "Files and versions" section for model definition.
pip install huggingface_hub
💻 Usage Examples
Basic Usage
from modeling_dit_wikiart import DiTWikiartModel
import torch
model = DiTWikiartModel.from_pretrained("kaupane/DiT-Wikiart-Small")
num_samples = 8
noisy_latents = torch.randn(num_samples, 4, 32, 32)
predicted_noise = model(noisy_latents)
print(predicted_noise)
📚 Documentation
Model Description
This model is a DiT (diffusion transformer) trained from scratch on the Wikiart dataset https://huggingface.co/datasets/Artificio/WikiArt. It is designed to generate art images given art genre and art style.
Model Architecture
The model largely mirrors the classic DiT architecture described in the paper Scalable Diffusion Models with Transformers with slight modifications:
- Replaced ImageNet classes embeddings with Wikiart genres and styles embeddings;
- Used post - norm instead of pre - norm;
- Omitted the final linear layer;
- Replaced sin - cos - 2d positional embedding with learned positional embedding;
- Models only predict noise and don't learn sigma;
- Set patch_size = 2 for all model variants;
- Models have different size settings.
The model has three variants:
- S: small, num_blocks = 8, hidden_size = 384, num_heads = 6, total_params = 20M;
- B: base, num_blocks = 12, hidden_size = 640, num_heads = 10, total_params = 90M;
- L: large, num_blocks = 16, hidden_size = 896, num_heads = 14, total_params = 234M.
Training Procedure
- Dataset: All model variants were trained on a 103K Wikiart dataset with data augmentation by horizontal flipping.
- Optimizer: AdamW with default settings.
- Learning Rate: Linear warmup for the first 1% steps where the learning rate reached a maximum of 3e - 4, then cosine decay to zero in the following steps.
- Epochs and Batch Size:
- S: 96 epochs with a batch size of 176.
- B: 120 epochs with a batch size of 192.
- L: 144 epochs with a batch size of 192.
- Device:
- S: single RTX 4060ti 16G for 24 hrs.
- B: single RTX 4060ti 16G for 90 hrs.
- L: single RTX 4090D 24G for 48 hrs, followed by single RTX 4060ti 16G for 100 hrs.
- Loss Curve: All variants witnessed a dramatic loss in the first epoch from above 1.0000 to around 0.2000, followed by a much slower decrease to finally reach loss = 0.1600 at the 20th epoch. DiT - S finally reached 0.1590; DiT - B finally reached 0.1525; DiT - L finally reached 0.1510.
Performance and Limitations
- Performance: The models demonstrate basic abilities to understand genres and styles and produce visually - appealing paintings (at first glance).
- Limitations:
- Failure to understand complex structures like human faces, buildings, etc.
- Occasional modal collapse when asked to generate genres or styles rarely seen in the dataset, e.g., style like minimalism and genre like uroshi - e.
- Resolution limited to 256x256.
- Trained on the Wikiart dataset, therefore unable to generate out - of - scope images.
📄 License
This model is licensed under the MIT license.