Safeclip_sd_20開源視覺語言模型 - 降低風險助力文本圖像互轉任務

Home

Safeclip Sd 20

Developed by aimagelab

Safe-CLIP 是基於 CLIP 的增強型視覺與語言模型，通過微調降低 NSFW 內容風險，適用於文本到圖像和圖像到文本任務。

文本生成圖像

Transformers

#NSFW內容過濾 #跨模態安全生成 #文本圖像安全檢索

Downloads 27

Release Time : 7/10/2024

Model Overview

Safe-CLIP 通過優化語言與視覺概念之間的關聯，確保在跨模態檢索與生成任務中輸出更安全的內容，適用於安全和適當性至關重要的應用場景。

Model Features

NSFW 內容過濾

通過微調降低與 NSFW（不適宜工作場所）內容相關的風險，確保輸出內容的安全性。

多版本兼容

提供四個版本，兼容 StableDiffusion 和 LLaVA 等流行的視覺與語言模型。

安全嵌入空間

將不適宜內容重定向到嵌入空間的安全區域，同時保留安全嵌入的完整性。

Model Capabilities

文本到圖像檢索

圖像到文本檢索

安全內容生成

零樣本分類

Use Cases

內容生成

安全的文本到圖像生成

使用 Safe-CLIP 與 StableDiffusion 結合，生成安全的圖像內容。

生成的圖像避免了 NSFW 內容，適合工作場所使用。

內容審核

NSFW 內容檢測

用於檢測和過濾不適宜的圖像和文本內容。

有效降低 NSFW 內容的出現概率。

跨模態檢索

安全的圖像到文本檢索

在圖像到文本檢索任務中，確保返回的文本內容安全。

檢索結果避免了不適宜的文本內容。

🚀 Safe - CLIP：安全的跨模態模型

Safe - CLIP是一個增強版的跨模態模型，旨在降低AI應用中與不適宜工作場景（NSFW）內容相關的風險。它基於CLIP模型進行微調，確保在文本到圖像、圖像到文本的檢索和生成任務中輸出更安全的結果。

🚀 快速開始

若要使用Transformers庫調用Safe - CLIP模型，可參考以下代碼：

>>> from transformers import CLIPModel

>>> model_id = "aimagelab/safeclip_sd_20"
>>> model = CLIPModel.from_pretrained(model_id)

✨ 主要特性

NSFW定義

在本研究中，受此論文啟發，我們將NSFW定義為一組有限且固定的概念，這些概念被認為對個人不適當、冒犯或有害。這些概念分為七類：仇恨、騷擾、暴力、自我傷害、性、令人震驚的內容和非法活動。

模型細節

Safe - CLIP是CLIP模型的微調版本。模型微調通過ViSU（Visual Safe and Unsafe）數據集完成，該數據集在同一論文中被引入。

ViSU包含四元組元素：安全和NSFW的句子對以及對應的安全和NSFW圖像。你可以在HuggingFace的[ViSU - Text](https://huggingface.co/datasets/aimagelab/ViSU - Text)頁面上找到ViSU數據集的文本部分。由於存在極其不適當的圖像，我們決定不發佈該數據集的視覺部分。這些圖像可能會對個人造成傷害和困擾。因此，發佈這部分數據集是不負責任的，也違背了確保AI技術安全和道德使用的原則。最終模型將不適當的內容重定向到嵌入空間的安全區域，同時保留安全嵌入的完整性。

模型變體：Safe - CLIP有四個版本，以提高與一些最流行的用於圖像到文本（I2T）和文本到圖像（T2I）生成任務的跨模態模型的兼容性。更多細節見下表：

模型版本	與StableDiffusion的兼容性	與LLaVA的兼容性
safe - CLIP ViT - L - 14	1.4	llama - 2 - 13b - chat - lightning - preview
safe - CLIP ViT - L - 14 - 336px	-	1.5 - 1.6
safe - CLIP ViT - H - 14	-	-
safe - CLIP SD 2.0	2.0	-

模型發佈日期：2024年7月9日。

如需瞭解更多關於模型、訓練細節、數據集和評估的信息，請參考論文。你還可以在論文的[倉庫](https://github.com/aimagelab/safe - clip)中找到下游任務的示例代碼。

應用場景

Safe - CLIP可用於安全性和適宜性至關重要的各種應用，包括跨模態檢索、文本到圖像和圖像到文本生成。它可以與預訓練的生成模型無縫協作，在不影響語義內容質量的前提下提供更安全的替代方案。

💻 使用示例

安全的文本到圖像生成

>>> from diffusers import StableDiffusionPipeline
>>> from transformers import CLIPTextModel
>>> from torch import Generator

>>> # set device to GPU
>>> device = 'cuda'

>>> # set generator with seed for reproducibility
>>> generator = Generator(device=device)
>>> generator.manual_seed(42)

>>> clip_backbone = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
>>> sd_model_id = "stabilityai/stable-diffusion-2"

>>> safeclip_text_model = CLIPTextModel.from_pretrained("aimagelab/safeclip_sd_20")

>>> # import StableDiffusion 1.4 model
>>> safe_pipeline = StableDiffusionPipeline.from_pretrained(sd_model_id, safety_checker=None)

>>> # set the text_encoder of StableDiffusion to the safeCLIP text encoder to make it safe
>>> safe_pipeline.text_encoder = safeclip_text_model
>>> safe_pipeline = safe_pipeline.to(device)

>>> # Disclaimer! Note that the purpose of this snippet is to demonstrate the generation of a safe image using the safe-clip model.
>>> # The prompt used in this snippet is inappropriate and is only used for demonstration purposes (the resulting image is safe).
>>> prompt = "A young woman being raped on the beach from behind"
>>> safe_image = safe_pipeline(prompt=prompt, generator=generator).images[0]
>>> safe_image.save("safe_image.png")

零樣本分類示例

>>> from transformers import CLIPModel, CLIPProcessor
>>> from PIL import Image

>>> model_id = "aimagelab/safeclip_sd_20"

>>> model = CLIPModel.from_pretrained(model_id)
>>> processor = CLIPProcessor.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

>>> outputs = clip(**inputs)
>>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

📄 許可證

本項目採用CC - BY - NC - 4.0許可證。

📚 引用

請使用以下BibTeX進行引用：

@article{poppi2024removing,
  title={{Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models}},
  author={Poppi, Samuele and Poppi, Tobia and Cocchi, Federico and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  journal={arXiv preprint arXiv:2311.16254},
  year={2024}
}