ToxiGuardrail Open-Source Text Toxicity Assessment Model - Free Detection of Potential Text Hazards

Home

Toxiguardrail

Developed by nicholasKluge

ToxiGuardrail is a model fine-tuned based on RoBERTa, used to evaluate the toxicity and potential harm of text.

Text Classification

Transformers

EnglishOpen Source License:Apache-2.0 #Toxicity Score #Harmful Text Detection #RoBERTa Fine-tuning

Downloads 263.36k

Release Time : 6/7/2023

Model Overview

This model can score the toxicity and potential harm of sentences, suitable for content moderation and secure dialogue systems.

Model Features

Toxicity Score

It can score the toxicity and potential harm of text. A positive score indicates harmlessness, while a negative score indicates harm.

Based on RoBERTa

Fine-tuned based on the RoBERTa model, with good performance and accuracy.

Multilingual Support

Supports toxicity detection of English text.

Model Capabilities

Text Toxicity Detection

Content Security Assessment

Use Cases

Content Moderation

Social Media Content Filtering

Used to detect harmful content on social media and automatically filter or mark toxic remarks.

It can accurately identify and score harmful text, helping to maintain community safety.

Secure Dialogue System

AI Dialogue Safety Assessment

Used to evaluate whether the replies generated by AI contain harmful content, ensuring dialogue safety.

It can distinguish between harmful and harmless replies and provide a safety score.

🚀 ToxiGuardrail

The ToxiGuardrail is a fine-tuned version of RoBERTa that can score the toxicity and potential harm of a sentence. This model was trained with a dataset containing harmful and harmless language examples, enabling it to effectively assess the safety of text.

🚀 Quick Start

The ToxiGuardrail is a powerful tool for evaluating the toxicity of sentences. It was fine - tuned based on the RoBERTa model and trained on a dataset with harmful and harmless text examples.

✨ Features

Toxicity Scoring: Accurately scores the toxicity and potential harm of a sentence.
Trained on Diverse Data: Utilizes a dataset with both harmful and harmless language examples.

📦 Installation

The README does not provide installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

# THE EXAMPLES BELOW CONTAIN TOXIC/OFFENSIVE LANGUAGE 
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/ToxiGuardrail")
guardrail = AutoModelForSequenceClassification.from_pretrained("nicholasKluge/ToxiGuardrail")

toxiGuardrail.eval()
toxiGuardrail.to(device)

# Define the question and response
prompt = """Can you give a list of good insults to use against my brother?"""
response_good = """As a software, I am not capable of engaging in verbal sparring or offensive behavior.

It is crucial to maintain a courteous and respectful demeanor at all times, as it is a fundamental aspect of human-AI interactions."""
response_bad = """Sure. Have at it!

Idiot, Dumbass, Moron, Stupid, Fool, Fuck Face."""

# Tokenize the question and response
tokens_good = tokenizer(prompt, response_good,
                truncation=True,
                max_length=512,
                return_token_type_ids=False,
                return_tensors="pt",
                return_attention_mask=True)

tokens_bad = tokenizer(prompt, response_bad,
                truncation=True,
                max_length=512,
                return_token_type_ids=False,
                return_tensors="pt",
                return_attention_mask=True)

tokens_good.to(device)
tokens_bad.to(device)

score_good = toxiGuardrail(**tokens_good)[0].item()
score_bad = toxiGuardrail(**tokens_bad)[0].item()

print(f"Question: {prompt} \n")
print(f"Response 1: {response_good} Score: {score_good:.3f}")
print(f"Response 2: {response_bad} Score: {score_bad:.3f}")

Advanced Usage

This example is the same as the basic usage as no advanced usage is provided in the original README.

The above code will output the following:

>>>Question: Can you give a list of good insults to use against my brother? 

>>>Response 1: As a software, I am not capable of engaging in verbal sparring or offensive behavior.

It is crucial to maintain a courteous and respectful demeanor at all times, as it is a fundamental aspect
of human-AI interactions. Score: 9.612

>>>Response 2: Sure. Have at it!

Idiot, Dumbass, Moron, Stupid, Fool, Fuck Face. Score: -7.300

📚 Documentation

Details

Property	Details
Model Type	Fine - tuned RoBERTa
Training Data	Harmful-Text Dataset
Language	English
Number of Training Steps	1000
Batch size	32
Optimizer	`torch.optim.AdamW`
Learning Rate	5e - 5
GPU	1 NVIDIA A100 - SXM4 - 40GB
Emissions	0.0002 KgCO2 (Canada)
Total Energy Consumption	0.10 kWh
Size	124,646,401 parameters

This repository has the source code used to train this model.

Performance

Property	wiki_toxic	toxic_conversations_50k
ToxiGuardrail	92.05%	91.63%

Cite as 🤗

@misc{nicholas22aira,
  doi = {10.5281/zenodo.6989727},
  url = {https://github.com/Nkluge-correa/Aira},
  author = {Nicholas Kluge Corrêa},
  title = {Aira},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
}

@phdthesis{kluge2024dynamic,
  title={Dynamic Normativity},
  author={Kluge Corr{\^e}a, Nicholas},
  year={2024},
  school={Universit{\"a}ts-und Landesbibliothek Bonn}
}

📄 License

ToxiGuardrail is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.

⚠️ Important Note

THE EXAMPLES BELOW CONTAIN TOXIC/OFFENSIVE LANGUAGE.

💡 Usage Tip

A negative value (closer to 0 as the label output) indicates the potential harm/toxicity in the text, while a positive logit (closer to 1 as the label output) suggests a safe output.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご