distilroberta-base-rejection-v1 Open-source Text Classification Model - Freely Identify Large Model Rejection Responses

Distilroberta Base Rejection V1

Developed by protectai

A text classification model fine-tuned based on distilroberta-base, used to identify rejection responses generated by large language models

Text Classification

Transformers

EnglishOpen Source License:Apache-2.0 #Rejection response detection #LLM security audit #High-precision classification

Downloads 74.91k

Release Time : 1/20/2024

Model Overview

This model is specifically designed to detect rejection responses generated by large language models due to content review failures, and classify the input as normal output (0) or rejection response (1)

Model Features

High-accuracy detection

Achieved an accuracy of 98.87% and an F1 value of 95.37% on the evaluation set

Lightweight model

Based on the distilled version of DistilRoBERTa, reducing computational resource requirements while maintaining high performance

Multi-dataset training

Combines multiple open-source datasets and RLHF data, covering a wide range of rejection response patterns

Model Capabilities

Text classification

Rejection response recognition

Content review assistance

Use Cases

Content security

LLM output monitoring

Monitor the output of large language models to identify potential rejection responses

Can help developers discover prompt words that may trigger content review

Prompt engineering

Prompt optimization feedback

Help optimize prompt word design by detecting rejection responses

Improve the success rate of LLM responses

🚀 distilroberta-base-rejection-v1

This model is a fine - tuned version of distilroberta - base. It is trained on multiple combined datasets of rejections from different LLMs and normal responses from RLHF datasets. Its main purpose is to identify rejections in LLMs when the prompt fails content moderation, classifying inputs into normal outputs (0) and detected rejections (1).

🚀 Quick Start

Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")

classifier = pipeline(
  "text-classification",
  model=model,
  tokenizer=tokenizer,
  truncation=True,
  max_length=512,
  device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)

print(classifier("Sorry, but I can't assist with that."))

Optimum with ONNX

Loading the model requires the 🤗 Optimum library installed.

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1", subfolder="onnx")
model = ORTModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1", export=False, subfolder="onnx")

classifier = pipeline(
  task="text-classification",
  model=model,
  tokenizer=tokenizer,
  truncation=True,
  max_length=512,
)

print(classifier("Sorry, but I can't assist with that."))

Use in LLM Guard

NoRefusal Scanner can be used to detect if output was rejected, which can signal that something is going wrong with the prompt.

✨ Features

Rejection Identification: Classifies inputs into two categories: 0 for normal output and 1 for rejection detected.
Multiple Metrics: Achieves high performance on evaluation set metrics such as accuracy, recall, precision, and F1.

📚 Documentation

Model details

Property	Details
Fine - tuned by	ProtectAI.com
Model Type	distilroberta - base
Language(s) (NLP)	English
License	Apache license 2.0
Finetuned from model	distilroberta - base

Intended Uses & Limitations

The model aims to identify rejection, classifying inputs into two categories: 0 for normal output and 1 for rejection detected. However, its performance is dependent on the nature and quality of the training data. It might not perform well on text styles or topics not represented in the training set. Additionally, distilroberta - base is a case - sensitive model.

Training and evaluation data

The model was trained on a custom dataset from multiple open - source ones, with ~10% rejections and ~90% of normal outputs. The following papers were used when preparing the datasets:

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e - 05
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	Recall	Precision	F1
0.0525	1.0	3536	0.0355	0.9912	0.9583	0.9675	0.9629
0.0219	2.0	7072	0.0312	0.9919	0.9917	0.9434	0.9669
0.0121	3.0	10608	0.0350	0.9939	0.9905	0.9596	0.9748

Framework versions

Transformers 4.36.2
Pytorch 2.1.2+cu121
Datasets 2.16.1
Tokenizers 0.15.0

🔧 Technical Details

The model achieves the following results on the evaluation set:

Loss: 0.0544
Accuracy: 0.9887
Recall: 0.9810
Precision: 0.9279
F1: 0.9537

📄 License

This model is released under the Apache license 2.0.

Community

Join our Slack to give us feedback, connect with the maintainers and fellow users, ask questions, get help for package usage or contributions, or engage in discussions about LLM security!

Citation

@misc{distilroberta-base-rejection-v1,
  author = {ProtectAI.com},
  title = {Fine-Tuned DistilRoberta-Base for Rejection in the output Detection},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/ProtectAI/distilroberta-base-rejection-v1},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご