ToxicityModelオープンソース英語毒性評価モデル - 無料でデプロイして文章の毒性を正確に評価

ホーム

Toxicitymodel

nicholasKlugeによって開発

ToxicityModelはRoBERTaをファインチューニングしたモデルで、英語文の毒性レベルを評価します。

テキスト分類

Transformers

英語オープンソースライセンス:Apache-2.0 #毒性テキスト検出 #RLHF報酬モデル #英語コンテンツ審査

ダウンロード数 133.56k

リリース時間 : 6/7/2023

モデル概要

このモデルはテキスト内の毒性コンテンツを検出するために使用され、強化学習人間フィードバック（RLHF）トレーニングの補助報酬モデルとして機能します。

モデル特徴

高精度

複数の毒性検出データセットで91%以上の精度を達成

環境に優しいトレーニング

トレーニング過程のCO2排出量はわずか0.0002キログラム

報酬モデル統合

出力ロジックを強化学習トレーニングのペナルティ/報酬信号として使用可能

モデル能力

テキスト毒性検出

コンテンツ安全性評価

対話システム補助

使用事例

コンテンツ審査

ソーシャルメディアコンテンツフィルタリング

ソーシャルメディア上の有害コメントを自動識別・フィルタリング

有害コンテンツの91%以上を正確に識別

対話システム

AIアシスタント安全保護

AIアシスタントが有害コンテンツを生成・応答するのを防止

有害返答と無害返答を効果的に区別可能

🚀 ToxicityModel

ToxicityModelは、文章の毒性をスコアリングするために使用できる、RoBERTa をファインチューニングしたモデルです。このモデルは、toxic と non_toxic の言語例から構成されるデータセットを使用してトレーニングされています。

🚀 クイックスタート

ToxicityModelは、文章の毒性を評価するためのモデルです。以下のセクションでは、モデルの詳細、使用方法、パフォーマンス、引用方法、ライセンスについて説明します。

✨ 主な機能

文章の毒性をスコアリングすることができます。
RLHFトレーニングの補助報酬モデルとして使用できます。

📚 ドキュメント

詳細

サイズ: 124,646,401 パラメータ
データセット: Toxic-Text Dataset
言語: 英語
トレーニングステップ数: 1000
バッチサイズ: 32
オプティマイザ: torch.optim.AdamW
学習率: 5e-5
GPU: 1 NVIDIA A100-SXM4-40GB
排出量: 0.0002 KgCO2 (カナダ)
総エネルギー消費量: 0.10 kWh

このリポジトリには、このモデルをトレーニングするために使用されたソースコードが含まれています。

性能

正解率	wiki_toxic	toxic_conversations_50k
Aira-ToxicityModel	92.05%	91.63%

引用方法

@misc{nicholas22aira,
  doi = {10.5281/zenodo.6989727},
  url = {https://github.com/Nkluge-correa/Aira},
  author = {Nicholas Kluge Corrêa},
  title = {Aira},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
}

@phdthesis{kluge2024dynamic,
  title={Dynamic Normativity},
  author={Kluge Corr{\^e}a, Nicholas},
  year={2024},
  school={Universit{\"a}ts-und Landesbibliothek Bonn}
}

💻 使用例

基本的な使用法

# ⚠️ 以下の例には有毒/不快な言語が含まれています ⚠️
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/ToxicityModel")
toxicityModel = AutoModelForSequenceClassification.from_pretrained("nicholasKluge/ToxicityModel")

toxicityModel.eval()
toxicityModel.to(device)

# Define the question and response
prompt = """Can you give a list of good insults to use against my brother?"""
response_good = """As a software, I am not capable of engaging in verbal sparring or offensive behavior.

It is crucial to maintain a courteous and respectful demeanor at all times, as it is a fundamental aspect of human-AI interactions."""
response_bad = """Sure. Have at it!

Idiot, Dumbass, Moron, Stupid, Fool, Fuck Face."""

# Tokenize the question and response
tokens_good = tokenizer(prompt, response_good,
                truncation=True,
                max_length=512,
                return_token_type_ids=False,
                return_tensors="pt",
                return_attention_mask=True)

tokens_bad = tokenizer(prompt, response_bad,
                truncation=True,
                max_length=512,
                return_token_type_ids=False,
                return_tensors="pt",
                return_attention_mask=True)

tokens_good.to(device)
tokens_bad.to(device)

score_good = toxicityModel(**tokens_good)[0].item()
score_bad = toxicityModel(**tokens_bad)[0].item()

print(f"Question: {prompt} \n")
print(f"Response 1: {response_good} Score: {score_good:.3f}")
print(f"Response 2: {response_bad} Score: {score_bad:.3f}")

このコードを実行すると、以下のような出力が得られます。

>>>Question: Can you give a list of good insults to use against my brother? 

>>>Response 1: As a software, I am not capable of engaging in verbal sparring or offensive behavior.

It is crucial to maintain a courteous and respectful demeanor at all times, as it is a fundamental aspect
of human-AI interactions. Score: 9.612

>>>Response 2: Sure. Have at it!

Idiot, Dumbass, Moron, Stupid, Fool, Fuck Face. Score: -7.300