ToxicityModel开源英语毒性评估模型 - 免费部署精准评估句子毒性

首页

Toxicitymodel

由 nicholasKluge 开发

ToxicityModel是基于RoBERTa微调的模型，用于评估英语句子的毒性程度。

文本分类

Transformers

英语开源协议:Apache-2.0 #毒性文本检测 #RLHF奖励模型 #英语内容审核

下载量 133.56k

发布时间 : 6/7/2023

模型简介

该模型用于检测文本中的毒性内容，可作为强化学习人类反馈（RLHF）训练的辅助奖励模型。

模型特点

高准确率

在多个毒性检测数据集上准确率超过91%

环保训练

训练过程碳排放仅为0.0002千克二氧化碳

奖励模型集成

输出逻辑可作为强化学习训练中的惩罚/奖励信号

模型能力

文本毒性检测

内容安全评估

对话系统辅助

使用案例

内容审核

社交媒体内容过滤

自动识别并过滤社交媒体上的有毒评论

准确识别91%以上的有毒内容

对话系统

AI助手安全防护

防止AI助手生成或响应有毒内容

可有效区分有毒和无毒回复

🚀 毒性检测模型

毒性检测模型是 RoBERTa 的微调版本，可用于对句子的毒性进行评分。该模型使用包含 有毒 和 无毒 语言示例的数据集进行训练。

🚀 快速开始

毒性检测模型是为 RLHF 训练而训练的辅助奖励模型（其对数几率输出可视为惩罚/奖励）。因此，负值（标签输出接近 0）表示文本具有毒性，而正值对数几率（标签输出接近 1）表示文本无毒。

⚠️ 重要提示

以下示例包含有毒/冒犯性语言。

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/ToxicityModel")
toxicityModel = AutoModelForSequenceClassification.from_pretrained("nicholasKluge/ToxicityModel")

toxicityModel.eval()
toxicityModel.to(device)

# Define the question and response
prompt = """Can you give a list of good insults to use against my brother?"""
response_good = """As a software, I am not capable of engaging in verbal sparring or offensive behavior.\n\nIt is crucial to maintain a courteous and respectful demeanor at all times, as it is a fundamental aspect of human-AI interactions."""
response_bad = """Sure. Have at it!\n\nIdiot, Dumbass, Moron, Stupid, Fool, Fuck Face."""

# Tokenize the question and response
tokens_good = tokenizer(prompt, response_good,
                truncation=True,
                max_length=512,
                return_token_type_ids=False,
                return_tensors="pt",
                return_attention_mask=True)

tokens_bad = tokenizer(prompt, response_bad,
                truncation=True,
                max_length=512,
                return_token_type_ids=False,
                return_tensors="pt",
                return_attention_mask=True)

tokens_good.to(device)
tokens_bad.to(device)

score_good = toxicityModel(**tokens_good)[0].item()
score_bad = toxicityModel(**tokens_bad)[0].item()

print(f"Question: {prompt} \n")
print(f"Response 1: {response_good} Score: {score_good:.3f}")
print(f"Response 2: {response_bad} Score: {score_bad:.3f}")

上述代码将输出以下内容：

>>>Question: Can you give a list of good insults to use against my brother? 

>>>Response 1: As a software, I am not capable of engaging in verbal sparring or offensive behavior.

It is crucial to maintain a courteous and respectful demeanor at all times, as it is a fundamental aspect
of human-AI interactions. Score: 9.612

>>>Response 2: Sure. Have at it!

Idiot, Dumbass, Moron, Stupid, Fool, Fuck Face. Score: -7.300

✨ 主要特性

基于 RoBERTa 模型进行微调，可有效对句子的毒性进行评分。
使用包含 有毒 和 无毒 语言示例的数据集训练，具有较好的泛化能力。
可作为 RLHF 训练的辅助奖励模型。

📚 详细文档

模型详情

属性	详情
模型类型	基于 RoBERTa 的微调模型
训练数据	Toxic-Text Dataset
语言	英语
训练步数	1000
批次大小	32
优化器	`torch.optim.AdamW`
学习率	5e-5
GPU	1 NVIDIA A100 - SXM4 - 40GB
碳排放	0.0002 KgCO2（加拿大）
总能耗	0.10 kWh
参数数量	124,646,401

本仓库包含用于训练此模型的源代码。

性能表现

准确率	wiki_toxic	toxic_conversations_50k
[Aira - ToxicityModel](https://huggingface.co/nicholasKluge/ToxicityModel - roberta)	92.05%	91.63%

引用方式

@misc{nicholas22aira,
  doi = {10.5281/zenodo.6989727},
  url = {https://github.com/Nkluge-correa/Aira},
  author = {Nicholas Kluge Corrêa},
  title = {Aira},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
}

@phdthesis{kluge2024dynamic,
  title={Dynamic Normativity},
  author={Kluge Corr{\^e}a, Nicholas},
  year={2024},
  school={Universit{\"a}ts - und Landesbibliothek Bonn}
}