Bloomz-3b-guardrail Open-source Text Classification Model - Free Detection of Text Toxicity in Five Patterns

Bloomz 3b Guardrail

Developed by cmarkea

Bloomz-3b-guardrail is a text classification model fine-tuned based on Bloomz-3b-sft-chat, used to detect text toxicity in five modes.

Text Classification

Transformers

Supports Multiple LanguagesOpen Source License:Openrail #Multimodal toxicity detection #High correlation score #Support for English and French

Downloads 249

Release Time : 12/1/2023

Model Overview

This model aims to monitor and control the output of generative models, and detect the toxicity level of text in five modes: obscene content, explicit pornographic content, identity attacks, insults, and threats.

Model Features

Multimodal toxicity detection

It can detect the toxicity of text in five modes: obscene content, explicit pornographic content, identity attacks, insults, and threats.

High correlation

The model output is highly correlated with the judge's score, with a Pearson correlation of approximately 80.

Multilingual support

Supports toxicity detection in English and French.

Model Capabilities

Text toxicity detection

Multimodal classification

Multilingual processing

Use Cases

Content moderation

Social media content monitoring

Used to detect harmful content on social media, such as insults and threats.

It can accurately identify multiple toxicity modes and help platforms handle违规 content in a timely manner.

Output control of generative models

Monitor the output of generative models to ensure that they do not produce harmful content.

Effectively reduce the toxicity of generated content and improve the user experience.

🚀 Bloomz-3b-guardrail

This project introduces the Bloomz-3b-guardrail model, a fine-tuned version of the Bloomz-3b-sft-chat model. It's designed to detect text toxicity in five modes, which is useful for monitoring generative model outputs and measuring toxicity levels.

✨ Features

Five Toxicity Modes: The model can detect five types of text toxicity:
- Obscene: Offensive, indecent, or morally inappropriate content.
- Sexual explicit: Content presenting explicit sexual aspects.
- Identity attack: Content attacking someone based on their identity.
- Insult: Offensive and disrespectful content.
- Threat: Content presenting a direct threat to an individual.
Multilingual Support: Supports both English and French.

📦 Installation

The README doesn't provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import pipeline

guardrail = pipeline("text-classification", "cmarkea/bloomz-3b-guardrail")

list_text: List[str] = [...]
result = guardrail(
    list_text,
    return_all_scores=True, # Crucial for assessing all modalities of toxicity!
    function_to_apply='sigmoid' # To ensure obtaining a score between 0 and 1!
)

📚 Documentation

Training

Dataset: The training dataset contains 500k English comments and 500k French comments (translated by Google Translate), each annotated with a probability toxicity severity. It's provided by Jigsaw as part of the Jigsaw Unintended Bias in Toxicity Classification Kaggle competition.
Loss Function: An optimization goal of cross-entropy type has been chosen: $$loss=l_{\mathrm{obscene}}+l_{\mathrm{sexual_explicit}}+l_{\mathrm{identity_attack}}+l_{\mathrm{insult}}+l_{\mathrm{threat}}$$ with $$l_i=\frac{-1}{\vert\mathcal{O}\vert}\sum_{o\in\mathcal{O}}\mathrm{score}{i,o}\log(\sigma(\mathrm{logit}{i,o}))+(\mathrm{score}{i,o}-1)\log(1-\sigma(\mathrm{logit}{i,o}))$$ Where sigma is the sigmoid function and $\mathcal{O}$ represents the set of learning observations.

Benchmark

Pearson's Inter - Correlation: Used to measure the correlation between the model's scores and the judges' scores for 730 unseen comments. The correlation is approximately 65 for the 560m model and 80 for the 3b model.
Performance Metrics:
- When taking the maximum of different modes, the score is highly correlated with the target toxicity of the original dataset (correlation of 0.976 and mean absolute error of $0.013\pm0.04$).
- With a toxicity threshold $\geq 0.5$, Precision - Recall AUC, ROC AUC, accuracy, and F1 - score are determined.

Model	Language	PR AUC (%)	ROC AUC (%)	Accuracy (%)	F1 - score (%)
Bloomz-560m-guardrail	French	77	85	78	60
Bloomz-560m-guardrail	English	77	84	79	62
Bloomz-3b-guardrail	French	82	89	84	72
Bloomz-3b-guardrail	English	80	88	82	70

🔧 Technical Details

The README provides detailed technical information about the model's training and benchmarking, so this section is covered in the "Documentation" part.

📄 License

The model is licensed under the bigscience-bloom-rail-1.0 license.

📖 Citation

@online{DeBloomzGuard,
  AUTHOR = {Cyrile Delestre},
  ORGANIZATION = {Cr{\'e}dit Mutuel Ark{\'e}a},
  URL = {https://huggingface.co/cmarkea/bloomz-3b-guardrail},
  YEAR = {2023},
  KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}