đ vit-base-nsfw-detector
This model is a fine - tuned version of [vit - base - patch16 - 384](https://huggingface.co/google/vit - base - patch16 - 384) on around 25,000 images (drawings, photos...). It can accurately classify images as NSFW or SFW, providing reliable image - classification results.
đ Quick Start
This model is a fine - tuned version of [vit - base - patch16 - 384](https://huggingface.co/google/vit - base - patch16 - 384) on around 25,000 images (drawings, photos...).
It achieves the following results on the evaluation set:
- Loss: 0.0937
- Accuracy: 0.9654
New [07/30]: I created a new ViT model specifically to detect NSFW/SFW images for stable diffusion usage (read the disclaimer below for the reason): [AdamCodd/vit - nsfw - stable - diffusion](https://huggingface.co/AdamCodd/vit - nsfw - stable - diffusion).
Disclaimer: This model wasn't made with generative images in mind! There is no generated image in the dataset used here, and it performs significantly worse on generative images, which will require another ViT model specifically trained on generative images. Here are the model's actual scores for generative images to give you an idea:
- Loss: 0.3682 (â 292.95%)
- Accuracy: 0.8600 (â 10.91%)
- F1: 0.8654
- AUC: 0.9376 (â 5.75%)
- Precision: 0.8350
- Recall: 0.8980
⨠Features
- High - accuracy Classification: Achieves an accuracy of 0.9654 on the evaluation set.
- Multiple Metrics: Provides multiple evaluation metrics such as AUC, loss, etc.
- Fine - tuned Model: Based on the pre - trained [vit - base - patch16 - 384](https://huggingface.co/google/vit - base - patch16 - 384) model.
đ Documentation
Model description
The Vision Transformer (ViT) is a transformer encoder model (BERT - like) pretrained on a large collection of images in a supervised fashion, namely ImageNet - 21k, at a resolution of 224x224 pixels. Next, the model was fine - tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, at a higher resolution of 384x384.
Intended uses & limitations
There are two classes: SFW and NSFW. The model has been trained to be restrictive and therefore classify "sexy" images as NSFW. That is, if the image shows cleavage or too much skin, it will be classified as NSFW. This is normal.
The model has been trained on a variety of images (realistic, 3D, drawings), yet it is not perfect and some images may be wrongly classified as NSFW when they are not. Additionally, please note that using the quantized ONNX model within the transformers.js pipeline will slightly reduce the model's accuracy.
You can find a toy implementation of this model with Transformers.js [here](https://github.com/AdamCodd/media - random - generator).
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e - 05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- num_epochs: 1
Training results
- Validation Loss: 0.0937
- Accuracy: 0.9654
- AUC: 0.9948
[Confusion matrix](https://huggingface.co/AdamCodd/vit - base - nsfw - detector/resolve/main/confusion_matrix.png) (eval):
[1076 37]
[ 60 1627]
Framework versions
- Transformers 4.36.2
- Evaluate 0.4.1
If you want to support me, you can [here](https://ko - fi.com/adamcodd).
đģ Usage Examples
Basic Usage
For a local image
from transformers import pipeline
from PIL import Image
img = Image.open("<path_to_image_file>")
predict = pipeline("image-classification", model="AdamCodd/vit-base-nsfw-detector")
predict(img)
For a distant image
from transformers import ViTImageProcessor, AutoModelForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
processor = ViTImageProcessor.from_pretrained('AdamCodd/vit-base-nsfw-detector')
model = AutoModelForImageClassification.from_pretrained('AdamCodd/vit-base-nsfw-detector')
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
With Transformers.js (Vanilla JS)
import { pipeline, env } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.17.1';
env.allowLocalModels = false;
const classifier = await pipeline('image-classification', 'AdamCodd/vit-base-nsfw-detector');
async function classifyImage(url) {
try {
const response = await fetch(url);
if (!response.ok) throw new Error('Failed to load image');
const blob = await response.blob();
const image = new Image();
const imagePromise = new Promise((resolve, reject) => {
image.onload = () => resolve(image);
image.onerror = reject;
image.src = URL.createObjectURL(blob);
});
const img = await imagePromise;
const classificationResults = await classifier([img.src]);
console.log('Predicted class: ', classificationResults[0].label);
} catch (error) {
console.error('Error classifying image:', error);
}
}
classifyImage('https://example.com/path/to/image.jpg');
đ License
This model is licensed under the [apache - 2.0](https://www.apache.org/licenses/LICENSE - 2.0) license.
đĻ Information Table
Property |
Details |
Model Type |
Fine - tuned Vision Transformer (ViT) |
Base Model |
[google/vit - base - patch16 - 384](https://huggingface.co/google/vit - base - patch16 - 384) |
Metrics |
Accuracy, AUC, Loss |
License |
apache - 2.0 |
Tags |
transformers.js, transformers, nlp |