Safe-CLIP Open-Source Vision-Language Model - Reducing the Risk of Inappropriate Content in AI Applications

Safeclip Vit H 14

Developed by aimagelab

Safe-CLIP is an enhanced vision-language model designed to mitigate risks associated with Not Safe For Work (NSFW) content in AI applications.

Text-to-Image

Transformers

#NSFW filtering #Cross-modal safety #Zero-shot classification

Downloads 30

Release Time : 7/9/2024

Model Overview

Based on the CLIP model, Safe-CLIP is fine-tuned to serve associations between language and visual concepts, ensuring safer outputs in text-to-image and image-to-text retrieval and generation tasks.

Model Features

NSFW content filtering

Fine-tuned to remove Not Safe For Work (NSFW) concepts, ensuring safer outputs.

Multi-version compatibility

Offers four versions compatible with popular vision-language models like StableDiffusion and LLaVA.

Safe embedding space

Redirects inappropriate content to safe regions in the embedding space while preserving the integrity of safe embeddings.

Model Capabilities

Text-to-image retrieval

Image-to-text retrieval

Cross-modal retrieval

NSFW content filtering

Use Cases

Content safety

Safe image retrieval

Filters Not Safe For Work content in image retrieval tasks.

Outputs safer retrieval results.

Safe text generation

Avoids generating Not Safe For Work content in text-to-image generation tasks.

Generates safer images.

Cross-modal applications

Cross-modal retrieval

Performs safe cross-modal retrieval between text and images.

Provides safe cross-modal associations.

🚀 Safe-CLIP

Safe-CLIP is an enhanced vision-and-language model. It aims to mitigate the risks associated with NSFW (Not Safe For Work) content in AI applications, ensuring safer outputs in relevant tasks.

🚀 Quick Start

Use with Transformers

See the snippet below for usage with Transformers:

>>> from transformers import CLIPModel

>>> model_id = "aimagelab/safeclip_vit-h_14"
>>> model = CLIPModel.from_pretrained(model_id)

✨ Features

Based on the CLIP model, it's fine - tuned to serve the association between linguistic and visual concepts.
Ensures safer outputs in text - to - image and image - to - text retrieval and generation tasks.
Comes in four versions to improve compatibility across popular vision - and - language models for I2T and T2I generation tasks.

📚 Documentation

NSFW Definition

In our work, inspired by this paper, we define NSFW as a finite and fixed set of concepts that are inappropriate, offensive, or harmful to individuals. These concepts are divided into seven categories: hate, harassment, violence, self - harm, sexual, shocking and illegal activities.

Model Details

Safe-CLIP is a fine - tuned version of CLIP model. The fine - tuning is done through the ViSU (Visual Safe and Unsafe) Dataset, introduced in the paper.

ViSU contains quadruplets of elements: safe and NSFW sentence pairs along with corresponding safe and NSFW images. The text portion of the ViSU Dataset is publicly released on the HuggingFace [ViSU - Text](https://huggingface.co/datasets/aimagelab/ViSU - Text) page. We decided not to release the Vision portion of the dataset due to extremely inappropriate images that could cause harm and distress. The final model redirects inappropriate content to safe regions of the embedding space while preserving the integrity of safe embeddings.

Variations:

Property	StableDiffusion compatibility	LLaVA compatibility
safe - CLIP ViT - L - 14	1.4	llama - 2 - 13b - chat - lightning - preview
safe - CLIP ViT - L - 14 - 336px	-	1.5 - 1.6
safe - CLIP ViT - H - 14	-	-
safe - CLIP SD 2.0	2.0	-

Model Release Date: 9 July 2024.

For more information about the model, training details, dataset, and evaluation, please refer to the paper. You can also find the downstream - tasks example codes in the repository of the paper [here](https://github.com/aimagelab/safe - clip).

Applications

Safe-CLIP can be used in various applications where safety and appropriateness are crucial, such as cross - modal retrieval, text - to - image, and image - to - text generation. It works well with pre - trained generative models, providing safer alternatives without sacrificing semantic content quality.

Downstream Use

More example codes can be found in the official Safe - CLIP [repo](https://github.com/aimagelab/safe - clip).

Zero - shot classification example

>>> from transformers import CLIPModel, CLIPProcessor
>>> from PIL import Image

>>> model_id = "aimagelab/safeclip_vit-h_14"

>>> model = CLIPModel.from_pretrained(model_id)
>>> processor = CLIPProcessor.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

>>> outputs = clip(**inputs)
>>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

📄 License

This model is released under the cc - by - nc - 4.0 license.

📚 Citation

Please cite with the following BibTeX:

@article{poppi2024removing,
  title={{Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models}},
  author={Poppi, Samuele and Poppi, Tobia and Cocchi, Federico and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  journal={arXiv preprint arXiv:2311.16254},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご