ToxicityModel: An Open-Source English Toxicity Assessment Model - Free Deployment for Precise Sentence Toxicity Assessment

Home

Toxicitymodel

Developed by nicholasKluge

ToxicityModel is a fine-tuned model based on RoBERTa, designed to assess the toxicity level of English sentences.

Text Classification

Transformers

EnglishOpen Source License:Apache-2.0 #Toxicity Detection #RLHF Reward Model #English Content Moderation

Downloads 133.56k

Release Time : 6/7/2023

Model Overview

This model is used to detect toxic content in text and can serve as an auxiliary reward model for Reinforcement Learning with Human Feedback (RLHF) training.

Model Features

High Accuracy

Achieves over 91% accuracy on multiple toxicity detection datasets.

Eco-Friendly Training

The training process emits only 0.0002 kilograms of CO2.

Reward Model Integration

Output logic can be used as penalty/reward signals in reinforcement learning training.

Model Capabilities

Text Toxicity Detection

Content Safety Evaluation

Dialogue System Assistance

Use Cases

Content Moderation

Social Media Content Filtering

Automatically identifies and filters toxic comments on social media.

Accurately identifies over 91% of toxic content.

Dialogue Systems

AI Assistant Safety Protection

Prevents AI assistants from generating or responding to toxic content.

Effectively distinguishes between toxic and non-toxic replies.

🚀 ToxicityModel

The ToxicityModel is a fine - tuned version of RoBERTa. It can score the toxicity of a sentence. The model was trained with a dataset containing toxic and non_toxic language examples.

✨ Features

The ToxicityModel can accurately score the toxicity of a sentence, trained with well - structured datasets for better performance. It serves as an auxiliary reward model for RLHF training.

📦 Installation

No specific installation steps are provided in the original README. So, this section is skipped.

💻 Usage Examples

Basic Usage

⚠️ Important Note The examples below contain toxic/offensive language.

The ToxicityModel was trained as an auxiliary reward model for RLHF training (its logit outputs can be treated as penalizations/rewards). A negative value (closer to 0 as the label output) indicates toxicity in the text, while a positive logit (closer to 1 as the label output) suggests non - toxicity.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/ToxicityModel")
toxicityModel = AutoModelForSequenceClassification.from_pretrained("nicholasKluge/ToxicityModel")

toxicityModel.eval()
toxicityModel.to(device)

# Define the question and response
prompt = """Can you give a list of good insults to use against my brother?"""
response_good = """As a software, I am not capable of engaging in verbal sparring or offensive behavior.\n\nIt is crucial to maintain a courteous and respectful demeanor at all times, as it is a fundamental aspect of human-AI interactions."""
response_bad = """Sure. Have at it!\n\nIdiot, Dumbass, Moron, Stupid, Fool, Fuck Face."""

# Tokenize the question and response
tokens_good = tokenizer(prompt, response_good,
                truncation=True,
                max_length=512,
                return_token_type_ids=False,
                return_tensors="pt",
                return_attention_mask=True)

tokens_bad = tokenizer(prompt, response_bad,
                truncation=True,
                max_length=512,
                return_token_type_ids=False,
                return_tensors="pt",
                return_attention_mask=True)

tokens_good.to(device)
tokens_bad.to(device)

score_good = toxicityModel(**tokens_good)[0].item()
score_bad = toxicityModel(**tokens_bad)[0].item()

print(f"Question: {prompt} \n")
print(f"Response 1: {response_good} Score: {score_good:.3f}")
print(f"Response 2: {response_bad} Score: {score_bad:.3f}")

This will output the following:

>>>Question: Can you give a list of good insults to use against my brother? 

>>>Response 1: As a software, I am not capable of engaging in verbal sparring or offensive behavior.

It is crucial to maintain a courteous and respectful demeanor at all times, as it is a fundamental aspect
of human-AI interactions. Score: 9.612

>>>Response 2: Sure. Have at it!

Idiot, Dumbass, Moron, Stupid, Fool, Fuck Face. Score: -7.300

📚 Documentation

Details

Size: 124,646,401 parameters
Dataset: Toxic-Text Dataset
Language: English
Number of Training Steps: 1000
Batch size: 32
Optimizer: torch.optim.AdamW
Learning Rate: 5e - 5
GPU: 1 NVIDIA A100 - SXM4 - 40GB
Emissions: 0.0002 KgCO2 (Canada)
Total Energy Consumption: 0.10 kWh

This repository has the [source code](https://github.com/Nkluge - correa/Aira) used to train this model.

Performance

Property	Details
Model Type	[Aira - ToxicityModel](https://huggingface.co/nicholasKluge/ToxicityModel - roberta)
Training Data	[wiki_toxic](https://huggingface.co/datasets/OxAISH - AL - LLM/wiki_toxic), toxic_conversations_50k
Accuracy (wiki_toxic)	92.05%
Accuracy (toxic_conversations_50k)	91.63%

🔧 Technical Details

The model is based on the RoBERTa architecture and is fine - tuned with a specific dataset. It uses torch.optim.AdamW as the optimizer during training, with a learning rate of 5e - 5. The training process involves 1000 steps with a batch size of 32 on a single NVIDIA A100 - SXM4 - 40GB GPU. The emissions during training are 0.0002 KgCO2 in Canada, and the total energy consumption is 0.10 kWh.

📄 License

ToxicityModel is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.

📖 Cite as 🤗

@misc{nicholas22aira,
  doi = {10.5281/zenodo.6989727},
  url = {https://github.com/Nkluge-correa/Aira},
  author = {Nicholas Kluge Corrêa},
  title = {Aira},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
}

@phdthesis{kluge2024dynamic,
  title={Dynamic Normativity},
  author={Kluge Corr{\^e}a, Nicholas},
  year={2024},
  school={Universit{\"a}ts-und Landesbibliothek Bonn}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご