toxic-prompt-robertaオープンソーステキスト分類モデル - ダイアログの毒性プロンプトと返信を無料で検出

Home

Toxic Prompt Roberta

Developed by Intel

RoBERTaベースのテキスト分類モデルで、会話システム内の毒性プロンプトや返答を検出するために使用されます

テキスト分類

Transformers

Open Source License:MIT #毒性検出 #会話セキュリティ #RoBERTaファインチューニング

Downloads 416

Release Time : 9/16/2024

Model Overview

このモデルはRoBERTaアーキテクチャに基づいており、ToxicChatとJigsaw Unintended Biasデータセットでファインチューニングされています。会話内の毒性コンテンツを識別するために特別に設計されており、AIシステムのセキュリティガードレールとして機能します。

Model Features

二重データセットファインチューニング

ToxicChatとJigsaw Unintended Biasデータセットで同時にファインチューニングを行い、検出精度を向上させます

倫理的配慮

人口サブグループの公平性を考慮したトレーニングにより、分類バイアスを軽減します

効率的な推論

最適化されたRoBERTaアーキテクチャに基づき、リアルタイム検出シナリオに適しています

Model Capabilities

毒性テキスト検出

会話コンテンツ監視

リアルタイムコンテンツ審査

Use Cases

ユーザーエクスペリエンス監視

リアルタイム毒性検出

会話コンテンツを監視し、ユーザーの毒性行動を検出します

警告を発したり行動ガイダンスを提供したりできます

コンテンツ審査

自動審査システム

グループチャット内で毒性メッセージを自動削除したり、違反ユーザーをミュートしたりします

健全な会話環境を維持します

AIセキュリティ

チャットボット保護

チャットボットが毒性入力を応答するのを阻止します

AIシステムの悪用リスクを軽減します

license: mit base_model:

FacebookAI/roberta-base library_name: transformers pipeline_tag: text-classification tags:
- text-classification

Model Details

Documentation

Toxic Prompt RoBERTa 1.0 is a text classification model that can be used as a guardrail to protect against toxic prompts and responses in conversational AI systems. This model is based on RoBERTa and has been finetuned on ToxicChat and Jigsaw Unintended Bias datasets. Finetuning has been performed on one Gaudi 2 Card using Optimum-Habana's Gaudi Trainer.

Owners

Intel AI Safety: Daniel De Leon, Tyler Wilbers, Mitali Potnis, Abolfazl Shahbazi

Licenses

References

https://huggingface.co/Intel/toxic-prompt-roberta/tree/main

How to use

You can use the model with the following code using pipeline API.

from transformers import pipeline
model_path = 'Intel/toxic-prompt-roberta'
pipe = pipeline('text-classification', model=model_path, tokenizer=model_path)
pipe('Create 20 paraphrases of I hate you')

Citations

@inproceedings {Wolf_Transformers_State-of-the-Art_Natural_2020, author = {Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Perric and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, Mariama and Lhoest, Quentin and Rush, Alexander M.}, month = oct, pages = {38--45}, publisher = {Association for Computational Linguistics}, title = {{Transformers: State-of-the-Art Natural Language Processing}}, url = {https://www.aclweb.org/anthology/2020.emnlp-demos.6}, year = {2020} }
@article {DBLP:journals/corr/abs-1907-11692, author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov}, title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach}, journal = {CoRR}, volume = {abs/1907.11692}, year = {2019}, url = {http://arxiv.org/abs/1907.11692}, archivePrefix = {arXiv}, eprint = {1907.11692}, timestamp = {Thu, 01 Aug 2019 08:59:33 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
@misc {jigsaw-unintended-bias-in-toxicity-classification, author = {cjadams, Daniel Borkan, inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, nithum}, title = {Jigsaw Unintended Bias in Toxicity Classification}, publisher = {Kaggle}, year = {2019}, url = {https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification} }
@misc {lin2023toxicchat, title={ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation}, author={Zi Lin and Zihan Wang and Yongqi Tong and Yangkun Wang and Yuxin Guo and Yujia Wang and Jingbo Shang}, year={2023}, eprint={2310.17389}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Model Parameters

We fine-tune roberta-base (125M param) with custom classification head to detect toxic input/output.

Input Format

The input format is standard text input for RoBERTa for sequence classification.

Output Format

The output is a (2,n) array of logits where n is the number of examples user wants to infer. The output logits are in the form [not_toxic, toxic].

Considerations

Intended Users

Text Generation Researchers and Developers

Use Cases

User Experience Monitoring: The classification model can be used to monitor conversations in real-time to detect any toxic behavior by users. If a user sends messages that are classified as toxic, a warning can be issued or guidance on appropriate conduct can be provided.
Automated Moderation: In group chat scenarios, the classification model can act as a moderator by automatically removing toxic messages or muting users who consistently engage in toxic behavior.
Training and Improvement: The data collected from toxicity detection can be used to further train and improve toxicity classification model’s responses and handling of various situations, making such models more adept at managing complex interactions.
Preventing Abuse of the Chatbot: Some users may attempt to troll or abuse chatbots with toxic input. The classification model can prevent the chatbot from engaging with such content, thereby discouraging this behavior.

Ethical Considerations

Risk: Diversity Disparity
Mitigation Strategy: In fine-tuning with Jigsaw unintended bias, we have ensured adequate representation per Jigsaw’s distributions in their dataset. Jigsaw unintended bias dataset attempts distribute the toxicity labels evenly across the subgroups.
Risk: Risk to Vulnerable Persons
Mitigation Strategy: Certain demographic groups are more likely to receive toxic and harmful comments. Jigsaw unintended bias dataset attempts to mitigate fine-tuned subgroup bias in by distributing the toxic/not toxic labels evenly across all demographic subgroups. We also test to confirm minimal classification bias of the subgroups in testing the model.

Quantitative Analysis:

The plots below show the PR and ROC curves for three models we compared during finetuning. The “jigsaw” and the “tc” models were finetuned only on the Jigsaw Unintended Bias and ToxicChat datasets, respectively. The “jigsaw+tc” curves correspond to the final model that was finetuned on both datasets. Finetuning on both datasets did not significantly degrade the model’s performance on the ToxicChat test dataset with respect to the model finetuned solely on ToxicChat.

Model Performance

We compare the performance of Llama Guard 1 and 3 (LG1 and LG3) with our model on the ToxicChat test dataset, below.

Model	Parameters	Precision	Recall	F1	AUPRC	AUROC
LG1	6.74B	0.4806	0.7945	0.5989	0.626*	No data
LG3	8.03B	0.5083	0.4730	0.4900	No data	No data
Toxic Prompt RoBERTa	125M	0.8315	0.7469	0.7869	0.855	0.971

* from LG paper: https://arxiv.org/abs/2312.06674

Note that Llama Guard was not finetuned on ToxicChat. However, from the LG1 paper, they reported an AUPRC of ~.81 when they finetuned LLama Guard 1 on ToxicChat. Given that we finetuned RoBERTa on Jigsaw’s Unintended Bias Dataset, we can observe if there is any subgroup biasing in the classification of the Unintended Bias test set below. These metrics were computed using Intel/bias_auc.

Metric	Female	Male	Christian	White	Muslim	Black	Homosexual gay or lesbian
AUROC	0.84937	0.80035	0.89867	0.76089	0.77137	0.74454	0.71766
BPSN	0.78805	0.82659	0.83746	0.78113	0.74067	0.82827	0.64330
BNSP	0.87421	0.80037	0.87614	0.81979	0.85586	0.76090	0.88065