Toxic Prompt RoBERTa Open-Source Text Classification Model - Free Detection of Toxic Prompts and Responses in Conversations

Home

Toxic Prompt Roberta

Developed by Intel

A RoBERTa-based text classification model for detecting toxic prompts and responses in dialogue systems

Text Classification

Transformers

Open Source License:MIT #Toxicity Detection #Conversation Safety #RoBERTa Fine-tuning

Downloads 416

Release Time : 9/16/2024

Model Overview

This model is based on the RoBERTa architecture, fine-tuned on the ToxicChat and Jigsaw Unintended Bias datasets, specifically designed to identify toxic content in conversations, serving as a safety guardrail for AI systems.

Model Features

Dual Dataset Fine-tuning

Fine-tuned on both ToxicChat and Jigsaw Unintended Bias datasets to improve detection accuracy

Ethical Considerations

Trained with fairness considerations for demographic subgroups to reduce classification bias

Efficient Inference

Optimized RoBERTa architecture suitable for real-time detection scenarios

Model Capabilities

Toxic Text Detection

Conversation Content Monitoring

Real-time Content Moderation

Use Cases

User Experience Monitoring

Real-time Toxicity Detection

Monitor conversation content to detect user toxic behavior

Can issue warnings or provide behavioral guidance

Content Moderation

Automated Moderation System

Automatically delete toxic messages or mute offending users in group chats

Maintain a healthy conversation environment

AI Safety

Chatbot Protection

Prevent chatbots from responding to toxic inputs

Reduce the risk of AI system abuse

🚀 Toxic Prompt RoBERTa Classification Model

This is a text classification model that can serve as a guardrail to protect against toxic prompts and responses in conversational AI systems, offering significant value in maintaining a healthy and safe communication environment.

🚀 Quick Start

You can use the model with the following code using pipeline API.

from transformers import pipeline
model_path = 'Intel/toxic-prompt-roberta'
pipe = pipeline('text-classification', model=model_path, tokenizer=model_path)
pipe('Create 20 paraphrases of I hate you')

✨ Features

Based on RoBERTa, it has high performance in text understanding.
Finetuned on ToxicChat and Jigsaw Unintended Bias datasets, enhancing its ability to detect toxic text.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

📚 Documentation

Model Details

Toxic Prompt RoBERTa 1.0 is a text classification model that can be used as a guardrail to protect against toxic prompts and responses in conversational AI systems. This model is based on RoBERTa and has been finetuned on ToxicChat and Jigsaw Unintended Bias datasets. Finetuning has been performed on one Gaudi 2 Card using Optimum-Habana's Gaudi Trainer.

Owners

Intel AI Safety: Daniel De Leon, Tyler Wilbers, Mitali Potnis, Abolfazl Shahbazi

Licenses

References

https://huggingface.co/Intel/toxic-prompt-roberta/tree/main

Model Parameters

We fine-tune roberta-base (125M param) with custom classification head to detect toxic input/output.
Input Format: The input format is standard text input for RoBERTa for sequence classification.
Output Format: The output is a (2,n) array of logits where n is the number of examples user wants to infer. The output logits are in the form [not_toxic, toxic].

Considerations

Intended Users

Text Generation Researchers and Developers

Use Cases

User Experience Monitoring: The classification model can be used to monitor conversations in real-time to detect any toxic behavior by users. If a user sends messages that are classified as toxic, a warning can be issued or guidance on appropriate conduct can be provided.
Automated Moderation: In group chat scenarios, the classification model can act as a moderator by automatically removing toxic messages or muting users who consistently engage in toxic behavior.
Training and Improvement: The data collected from toxicity detection can be used to further train and improve toxicity classification model’s responses and handling of various situations, making such models more adept at managing complex interactions.
Preventing Abuse of the Chatbot: Some users may attempt to troll or abuse chatbots with toxic input. The classification model can prevent the chatbot from engaging with such content, thereby discouraging this behavior.

Ethical Considerations

Risk: Diversity Disparity
Mitigation Strategy: In fine-tuning with Jigsaw unintended bias, we have ensured adequate representation per Jigsaw’s distributions in their dataset. Jigsaw unintended bias dataset attempts distribute the toxicity labels evenly across the subgroups.
Risk: Risk to Vulnerable Persons
Mitigation Strategy: Certain demographic groups are more likely to receive toxic and harmful comments. Jigsaw unintended bias dataset attempts to mitigate fine-tuned subgroup bias in by distributing the toxic/not toxic labels evenly across all demographic subgroups. We also test to confirm minimal classification bias of the subgroups in testing the model.

Quantitative Analysis

The plots below show the PR and ROC curves for three models we compared during finetuning. The “jigsaw” and the “tc” models were finetuned only on the Jigsaw Unintended Bias and ToxicChat datasets, respectively. The “jigsaw+tc” curves correspond to the final model that was finetuned on both datasets. Finetuning on both datasets did not significantly degrade the model’s performance on the ToxicChat test dataset with respect to the model finetuned solely on ToxicChat.

![Model Performance](data:image/jpeg;base64,iVBORw0KGgoAAAANSUhEUgAABWIAAAIFCAYAAABYnISHAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAAFiUAABYlAUlSJPAAAP+lSURBVHhe7J0FYBXH2obfuDtx3N2lhQKl1At1o97S3rr31v3+daUu1Kg7VSgtVooUd3dCQkJC3OX8887ZOWwOCSSQkBC+JwyzOzu7Z3V25t1vvvFwKCAIgiAIgiAIgiAIgiAIgiDUG55WLAiCIAiCIAiCIAiCIAiCINQTIsQKgiAIgiAIgiAIgiAIgiDUMyLECoIgCIIgCIIgCIIgCIIg1DMixAqCIAiCIAiCIAiCIAiCINQzIsQKgiAIgiAIgiAIgiAIgiDUMyLECoIgCIIgCIIgCIIgCIIg1DMixAqCIAhNCofDYU01HNXtQ2PYN6FpIPeSIBz5NOb3VVNBykpBEAShseGhXk7ydhIEQTiKyc3NxZdffomioiJ4eHhYqTWjtLQUbdq0wejRo+Ht7W2lHh7Kysrw008/Yfv27fD09ERQUBCOPfZYdOvWzcpRd+zZswerV69GSkoKsrOzUVxc7PrNZs2aoWXLlujcuXOlczBnzhwsWrRINwJ5nrp3745TTjnFWnrksXLlSsybNw/5+fn62N1hWmBgIKKjo9G+fXt9PtxJTU3Fzz//jIKCgiq3wfPHcxoTE4OOHTuibdu21pKmR05ODubOnavvKy8vL/j6+uKEE07Qx11TeE0mTZoEf39/K2UvfJZ9fHwQGhqKuLg49OrVC5GRkdbS/cPrxP3atWsX8vLyUFJSAj8/PwQHByM+Ph5du3bV17mm8HlZsWIFkpKSkJmZicLCQp3O/eY+JSYmokuXLnr7h8KWLVuwefNmvf/cbz53vKdCQkL0PdWuXTu0bt261uWcUDtmzJiBxYsXo6KiQpcFp512mr5/jiRYTvHdwmPg/X/WWWfpY6mPe2f9+vWYNWuWvmf5e3zGzjvvPP38Hi74+6tWrdLHyjBo0CAcd9xxh+VZ4Ttyx44d+pzz+M1v8vk9//zz9fu1Kvhenj59ul6X7xM+6wMHDkT//v2tHHv5+++/dZnGY2PdgeXs4MGD91smsixhmcJykO89e3kSGxury5MWLVro8vtQYTnO/SOsw4wcOfKwXn9BEAThMEMhVhAEQTh62bRpEz/IHXTo3bu3o6CgwNra4UM1jBw9evSotC+vvfaatbRuyMrKcvz000+OO++80zFgwABHREREpd9TDUZHmzZtHKNHj3a8+eabjszMTGtNh+PWW2+tlPfMM8+0lhw+Fi9e7Pjqq68cEyZMcHz44Yd6vri42FpaO8aNG+dQDcNKx1RVUI1mx3nnnef49NNPK50PMmfOHIdqMFe5ngmqoetQDVzHpZde6vjxxx8Pen8bOxs3bnSMGjWq0rF//PHH1tKawfvdvn5VISgoyNGrVy99Dy9cuNBRWlpqrb0vvN+///57x0033aTX4br2bYWEhDj69u3ruO222/RzkZeXZ61ZNfytv//+2/G///3PccIJJzji4+MrbY8hNjbWcfzxxzseeughx5YtW6w1awfX4z1++eWXO7p37+4ICAio9Bv+/v6Obt26OS655BLH119/ba0l1AerV692HHfcca5zf/HFFztKSkqspUcO9mNg+OSTTxwVFRXW0gPDe5/3JMtBBt531T0vH330kcPX19f1W+3bt3fk5uZaSw8PY8eOrXS81113naOsrMxaWr/wXD3zzDOVft8ElnHVvQOWL1+u6x/2/I888oi1tDLuxzdixAjHihUrrKWVSUpK0tfblCcsP+zr2suT3377zVFUVGStefDcfvvtru136dLFMXPmTGuJIAiC0BQRIVYQBOEoZ/v27Y62bdtqkTEyMlKHqKgoR2BgoBYaTeOAAlloaKheZvJR8DjrrLPqpCFSWwoLCx1nnHGG3gfuV4cOHXTDt65ITk52PPnkk45mzZq5zoEJ/L2wsLBKwiRFqpUrV1prOxz33XdfpfPHRtvhhoKZ+X2GW265xZGRkWEtrR3vvPOOIyYmxrUtLy8v1/0QHh6u7w/7b/n5+TmeeuqpSsLf/PnzdX57Pp43nmNuy57OEBcX5/jhhx+stZsWFA8pUpljDQ4Odnz55ZfW0prx7rvvVjpfFLnNc+wuojJQfFizZo21dmVSUlIcTzzxhL6W9nUoEDHNXYTnNXv++ecd6enp1hYqQ9GJYgYFC/t6DNw3btNdMKUIXFsoplx11VWVtsPA+5PPqPt9RSFZqD/uvfde/ezzXFPM/+OPP6wlRxannXZapTKNH7RqI8Tm5OS41jVh8+bN1tLKfP755/qDhMnHe/RAHznqmgcffNDRokULV1nODyPl5eXW0vqFx8p7xf6+NIHXoboya9WqVY5hw4a58lIgffrpp62lleGHUXuZyA+j/Ghgh9eX5QlFaJPPBJ6TqsqT888/37Ft2zZrCwfP0qVLdflstssybX8fzQRBEIQjG/ERKwiCcJSjGha48cYbcccdd7jCnXfeieHDh1fqTsqueBdffDHuueceVz5OswulaqRYuQ4f/M1LL70UDz/8MB544AHcdtttuvt1XcDu0y+++CIeffRRpKen6zR2P2R3+3PPPRc33XQTVMMOY8eO1eeJ3Z0jIiKq7G7fkLALJa+vgdPsDl4XsKs3r/1///tffS54b9BNhYHd0Z988kn8+++/Vsq+8JyeffbZehs333wzRo0apV09GNgl9Pnnn7fmhAMRHh7uepb/85//6HvT/myyGy+76PLa2KGbhDfffBOPPfYYsrKydBq7xbKL79VXX62v71VXXYU+ffq4tsfn4t5778WHH36ou+3aYfffr7/+GjfccIPu7mxgN96TTz4Z119/vd4m95Hd1umWgPdCbcsRujrgb3z88cdWCrR7A+7nZZddhltuuUWHyy+/HMcccwzCwsL0MyHUD3Rd8tdff7nuL55zXu+jEfdylu+Huip76wN2hb/rrrtw//336/ce542LgPqErgjWrFmDZcuWVenLlWXWhg0brLn6hb/z4IMP4v3337dSgICAAF2esAxhWcL3FOsddIMUFBSky5Oq9ru2sO5CVwkGuoqYOHGiNScIgiA0NcRHrCAIglAl77zzjhY4jTDDRvW7776Lnj176nl36DvV+H8sLy/XaRRW2FiJjIzUsZ28vDztc5UiDht89JFJsZd5DaWlpdp/ntkeG20dOnTQ2+Lri+sbf5P8La5bldDC9Sgc0R8uRQJuj6IpxSb6NeV6bHCZhudrr72mj52+TAlFIja8XnjhBX0e3Pnjjz/w7bff4qGHHnKJkWzQUkQ0r1k23j777DN9Prkv9MlL+LsUNavabx4b95n7YXzb8Vi4Te47G/YUl9jIt7Nz5059fvn7FMSMUEZRjeIxBTueW67bvHnzGvmi47VnAz0tLU3P9+3bV4t39vPx1Vdf6QYr95PwmlIQ++CDD/T8ggULtPCWkZGh5yn0T548Gccff7ye5z7xHPI8Gyhyz549GwkJCVbKgeH15THz3PE8c7tM47njded+8R6Kioqq0ncl94/HyXPOc837g7/Pe4f+R3ldmM51uYzbqQ4+F9yeuZd4vbmt3bt363uE/pkJBUQKABS0a8p7772nRU0DfVjSz6C5njz2Hj16YNOmTa77kEItP17Yz+eECRNw++23VxJhTzrpJHz00Uf63jTwvuL1pZhrnkkK5+PHj9c+NA0U4yiQ8ncJnytex6effrrK41u+fDlefvllXHvttdovZU3gtaCYaxdNWAbw/n7kkUe031k7vGY8Hvp8NPcXj4Hz5trwHNF3JK+RgcuSk5NdzxDLC/rcNc8r75GtW7e6nmceK4VlpvMacz1O83xT8OZzyXuQZQq3437v8B7lPcZ1CfeJx8L7zP6hh/tFH5ncPu9v5rOXZ9X5262qLOQ+89zxuWCZwnUZavthiR9UWCbwXLMs/7//+z99Te1wP3kOGJiP+27KNXuZzPuqKuGSx8trxvyE+02fxUznOeN54THxWLgNngtTrrvDvDzXjLlf/D3mZzj99NPx559/usoylm0XXnhhtdsy8FpTWOT9YC8b+RHsm2++0R8juE1uh75PWQZ/8cUXWgTlvpB+/frpZ4znw/gn5f7xmHi/7K+84Xnhc8zAfeC54Ha4LveBvp2ruq4so3h/cn3+Ft8p9t/hu533Oe9b7jvvfz7TPOf8Ld5LTGcZVFWZWh0893zX8p1roI9cbtece5ZXvLfsHxUJyzqWZzxXhNeP7yhuzx1+qLV/NDrzzDPx7LPP6meV8Fl66aWXdBll4HFccskleO655/bxic3tsNzjMfPjLN+jBve6EM8n71MGli0MvO6ct0Of99x33j+E785PP/1UTwuCIAhNDPVyEARBEIR9ePXVVyt1Ux44cKBj3rx51tK9qIan9tWmGkC6S6VqDLnWUQ0nx9ChQ7WfN3bFVo0Say1nN3W7Hz7VUHSoBpe11KHzTp061bWcITg42LFgwQK9XDXg9O+ZZarRpP20uqMaWA7V


### 📄 License
- MIT

### Citations
- @inproceedings {Wolf_Transformers_State-of-the-Art_Natural_2020, author = {Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Perric and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, Mariama and Lhoest, Quentin and Rush, Alexander M.}, month = oct, pages = {38--45}, publisher = {Association for Computational Linguistics}, title = {{Transformers: State-of-the-Art Natural Language Processing}}, url = {https://www.aclweb.org/anthology/2020.emnlp-demos.6}, year = {2020} }
- @article {DBLP:journals/corr/abs-1907-11692,   author    = {Yinhan Liu and                Myle Ott and                Naman Goyal and                Jingfei Du and                Mandar Joshi and                Danqi Chen and                Omer Levy and                Mike Lewis and                Luke Zettlemoyer and                Veselin Stoyanov},   title     = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},   journal   = {CoRR},   volume    = {abs/1907.11692},   year      = {2019},   url       = {http://arxiv.org/abs/1907.11692},   archivePrefix = {arXiv},   eprint    = {1907.11692},   timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},   biburl    = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},   bibsource = {dblp computer science bibliography, https://dblp.org} }
- @misc {jigsaw-unintended-bias-in-toxicity-classification,     author = {cjadams, Daniel Borkan, inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, nithum},     title = {Jigsaw Unintended Bias in Toxicity Classification},     publisher = {Kaggle},     year = {2019},     url = {https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification} }
- @misc {lin2023toxicchat,       title={ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation},        author={Zi Lin and Zihan Wang and Yongqi Tong and Yangkun Wang and Yuxin Guo and Yujia Wang and Jingbo Shang},       year={2023},       eprint={2310.17389},       archivePrefix={arXiv},       primaryClass={cs.CL} }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご