đ Safe-CLIP
Safe-CLIP is an enhanced vision-and-language model. It aims to mitigate the risks associated with NSFW (Not Safe For Work) content in AI applications, ensuring safer outputs in relevant tasks.
đ Quick Start
Use with Transformers
See the snippet below for usage with Transformers:
>>> from transformers import CLIPModel
>>> model_id = "aimagelab/safeclip_vit-h_14"
>>> model = CLIPModel.from_pretrained(model_id)
⨠Features
- Based on the CLIP model, it's fine - tuned to serve the association between linguistic and visual concepts.
- Ensures safer outputs in text - to - image and image - to - text retrieval and generation tasks.
- Comes in four versions to improve compatibility across popular vision - and - language models for I2T and T2I generation tasks.
đ Documentation
NSFW Definition
In our work, inspired by this paper, we define NSFW as a finite and fixed set of concepts that are inappropriate, offensive, or harmful to individuals. These concepts are divided into seven categories: hate, harassment, violence, self - harm, sexual, shocking and illegal activities.
Model Details
Safe-CLIP is a fine - tuned version of CLIP model. The fine - tuning is done through the ViSU (Visual Safe and Unsafe) Dataset, introduced in the paper.
ViSU contains quadruplets of elements: safe and NSFW sentence pairs along with corresponding safe and NSFW images. The text portion of the ViSU Dataset is publicly released on the HuggingFace [ViSU - Text](https://huggingface.co/datasets/aimagelab/ViSU - Text) page. We decided not to release the Vision portion of the dataset due to extremely inappropriate images that could cause harm and distress. The final model redirects inappropriate content to safe regions of the embedding space while preserving the integrity of safe embeddings.
Variations:
Property |
StableDiffusion compatibility |
LLaVA compatibility |
safe - CLIP ViT - L - 14 |
1.4 |
llama - 2 - 13b - chat - lightning - preview |
safe - CLIP ViT - L - 14 - 336px |
- |
1.5 - 1.6 |
safe - CLIP ViT - H - 14 |
- |
- |
safe - CLIP SD 2.0 |
2.0 |
- |
Model Release Date: 9 July 2024.
For more information about the model, training details, dataset, and evaluation, please refer to the paper. You can also find the downstream - tasks example codes in the repository of the paper [here](https://github.com/aimagelab/safe - clip).
Applications
Safe-CLIP can be used in various applications where safety and appropriateness are crucial, such as cross - modal retrieval, text - to - image, and image - to - text generation. It works well with pre - trained generative models, providing safer alternatives without sacrificing semantic content quality.
Downstream Use
More example codes can be found in the official Safe - CLIP [repo](https://github.com/aimagelab/safe - clip).
Zero - shot classification example
>>> from transformers import CLIPModel, CLIPProcessor
>>> from PIL import Image
>>> model_id = "aimagelab/safeclip_vit-h_14"
>>> model = CLIPModel.from_pretrained(model_id)
>>> processor = CLIPProcessor.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
>>> outputs = clip(**inputs)
>>> logits_per_image = outputs.logits_per_image
>>> probs = logits_per_image.softmax(dim=1)
đ License
This model is released under the cc - by - nc - 4.0 license.
đ Citation
Please cite with the following BibTeX:
@article{poppi2024removing,
title={{Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models}},
author={Poppi, Samuele and Poppi, Tobia and Cocchi, Federico and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
journal={arXiv preprint arXiv:2311.16254},
year={2024}
}