DiT-Wikiart-Small Open Source Model - Freely Generate Artistic Images with Unique Styles

Home

Dit Wikiart Small

Developed by kaupane

A diffusion transformer model trained on the Wikiart dataset for generating artistic style images

Image Generation

Safetensors

Open Source License:MIT #Art Style Generation #Diffusion Transformer #Wikiart Dataset

Downloads 29

Release Time : 4/9/2025

Model Overview

This model is a DiT (Diffusion Transformer) trained from scratch on the Wikiart dataset, designed to generate artistic images based on art genres and styles.

Model Features

Art Genre and Style Embedding

Replaces ImageNet category embeddings with Wikiart genre and style embeddings, focusing on artistic image generation

Optimized Architecture Design

Incorporates optimizations like post-normalization and learnable positional embeddings to enhance model performance

Multi-Size Configurations

Offers three variants: S (Small), B (Base), and L (Large) to accommodate different needs

Efficient Training

Completed training on consumer-grade GPUs with well-performing loss curves

Model Capabilities

Artistic Image Generation

Style Transfer

Genre-Specific Image Generation

Use Cases

Art Creation

Art Style Exploration

Generate artworks of different genres and styles for creative inspiration

Can produce visually appealing paintings

Digital Art Tool

Serves as a component in digital art creation tools, providing style references

Education

Art Teaching Aid

Demonstrates characteristics of different art genres and styles

🚀 DiT-Wikiart Model

This model is a DiT (diffusion transformer) designed for unconditional image generation. It can generate art images based on art genre and art style, trained on the Wikiart dataset.

✨ Features

Genre and Style Awareness: Capable of understanding art genres and styles to generate relevant art images.
Multiple Model Variants: Offers three different size variants (S, B, L) to meet various needs.
Trained on Wikiart: Utilizes the Wikiart dataset for training, focusing on art image generation.

📦 Installation

To use the model, you need to install the "huggingface_hub" library and download modeling_dit_wikiart.py from the "Files and versions" section for model definition.

pip install huggingface_hub

💻 Usage Examples

Basic Usage

from modeling_dit_wikiart import DiTWikiartModel
import torch

model = DiTWikiartModel.from_pretrained("kaupane/DiT-Wikiart-Small")
num_samples = 8
noisy_latents = torch.randn(num_samples, 4, 32, 32)
predicted_noise = model(noisy_latents)
print(predicted_noise)

📚 Documentation

Model Description

This model is a DiT (diffusion transformer) trained from scratch on the Wikiart dataset https://huggingface.co/datasets/Artificio/WikiArt. It is designed to generate art images given art genre and art style.

Model Architecture

The model largely mirrors the classic DiT architecture described in the paper Scalable Diffusion Models with Transformers with slight modifications:

Replaced ImageNet classes embeddings with Wikiart genres and styles embeddings;
Used post - norm instead of pre - norm;
Omitted the final linear layer;
Replaced sin - cos - 2d positional embedding with learned positional embedding;
Models only predict noise and don't learn sigma;
Set patch_size = 2 for all model variants;
Models have different size settings.

The model has three variants:

S: small, num_blocks = 8, hidden_size = 384, num_heads = 6, total_params = 20M;
B: base, num_blocks = 12, hidden_size = 640, num_heads = 10, total_params = 90M;
L: large, num_blocks = 16, hidden_size = 896, num_heads = 14, total_params = 234M.

Training Procedure

Dataset: All model variants were trained on a 103K Wikiart dataset with data augmentation by horizontal flipping.
Optimizer: AdamW with default settings.
Learning Rate: Linear warmup for the first 1% steps where the learning rate reached a maximum of 3e - 4, then cosine decay to zero in the following steps.
Epochs and Batch Size:
- S: 96 epochs with a batch size of 176.
- B: 120 epochs with a batch size of 192.
- L: 144 epochs with a batch size of 192.
Device:
- S: single RTX 4060ti 16G for 24 hrs.
- B: single RTX 4060ti 16G for 90 hrs.
- L: single RTX 4090D 24G for 48 hrs, followed by single RTX 4060ti 16G for 100 hrs.
Loss Curve: All variants witnessed a dramatic loss in the first epoch from above 1.0000 to around 0.2000, followed by a much slower decrease to finally reach loss = 0.1600 at the 20th epoch. DiT - S finally reached 0.1590; DiT - B finally reached 0.1525; DiT - L finally reached 0.1510.

Performance and Limitations

Performance: The models demonstrate basic abilities to understand genres and styles and produce visually - appealing paintings (at first glance).
Limitations:
- Failure to understand complex structures like human faces, buildings, etc.
- Occasional modal collapse when asked to generate genres or styles rarely seen in the dataset, e.g., style like minimalism and genre like uroshi - e.
- Resolution limited to 256x256.
- Trained on the Wikiart dataset, therefore unable to generate out - of - scope images.

📄 License

This model is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご