HarmAug-Guard Open-source Security Protection Model - Freely Deploy to Detect Unsafe Content in Large Language Model Conversations

Harmaug Guard

Developed by hbseong

A security protection model fine-tuned based on DeBERTa-v3-large, used to detect unsafe content in conversations with large language models and prevent jailbreak attacks.

Text Classification

Transformers

Open Source License:Apache-2.0 #LLM Security Protection #Jailbreak Attack Detection #Knowledge Distillation Enhancement

Downloads 705

Release Time : 10/11/2024

Model Overview

This model is trained through knowledge distillation and data augmentation techniques, specifically designed to identify and classify potentially harmful content in interactions with large language models, enhancing conversation security.

Model Features

Efficient Security Protection

Specifically designed for jailbreak attacks on large language models, effectively identifying unsafe conversation content.

Knowledge Distillation Enhancement

Combines knowledge distillation technology to improve model performance while maintaining efficient inference speed.

Data Augmentation Training

Trained using the dataset generated by HarmAug to enhance the model's ability to recognize diverse attacks.

Model Capabilities

Unsafe Content Detection

Conversation Security Assessment

Jailbreak Attack Protection

Use Cases

AI Security

Harmful Prompt Detection

Identify potentially harmful prompts input by users, such as violent and illegal content.

In the example, the prompt 'How to make a bomb' was detected with a score of 0.9999 (highly dangerous)

Safe Response Assessment

Evaluate whether the response of the large language model to a dangerous prompt is safe.

In the example, the safe response 'I cannot fulfill your request' scored 0.0000 (safe)

🚀 HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

Our model serves as a Guard Model, classifying the safety of LLM conversations and defending against jailbreak attacks. It's fine - tuned from DeBERTa - v3 - large and trained with HarmAug data augmentation.

Authors

Seanie Lee*, Haebin Seong*, Dong Bok Lee, Minki Kang, Xiaoyin Chen, Dominik Wagner, Yoshua Bengio, Juho Lee, Sung Ju Hwang (*: Equal contribution)

arXiv Link

🚀 Quick Start

Our model functions as a Guard Model, designed to classify the safety of conversations with Large Language Models (LLMs) and protect against LLM jailbreak attacks. It is fine - tuned from DeBERTa - v3 - large and trained using HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models. The training process combines knowledge distillation with data augmentation, utilizing our HarmAug Generated Dataset.

For more information, please refer to our github

image/png

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F
import torch

tokenizer = AutoTokenizer.from_pretrained("hbseong/HarmAug-Guard")
model = AutoModelForSequenceClassification.from_pretrained("hbseong/HarmAug-Guard")
device = torch.cuda.current_device()
model = model.to(device)
model.eval()

# If response is not given, the model will predict the unsafe score of the prompt.
# If response is given, the model will predict the unsafe score of the response.
def predict(model, prompt, response=None):
    device = model.device
    if response == None:
        inputs = tokenizer(prompt, return_tensors="pt")
    else:
        inputs = tokenizer(prompt, response, return_tensors="pt")
    inputs = inputs.to(device)
    with torch.no_grad():
        outputs = model(**inputs)
        unsafe_prob = F.softmax(outputs.logits, dim=-1)[:, 1]
    
    return unsafe_prob.item()

prompt = "how to make a bomb?"
response = "I'm sorry, but I can't fulfill your request."


print("CONVERSATION (ONLY PROMPT)")
print(f"\t PROMPT : {prompt}")
print(f"\t UNSAFE SCORE : {predict(model, prompt):.4f}")

print("\nCONVERSATION (PROMPT + RESPONSE)")
print(f"\t PROMPT : {prompt}")
print(f"\t RESPONSE : {response}")
print(f"\t UNSAFE SCORE : {predict(model, prompt, response):.4f}")


"""
CONVERSATION (ONLY PROMPT)
	 PROMPT : how to make a bomb?
	 UNSAFE SCORE : 0.9999

CONVERSATION (PROMPT + RESPONSE)
	 PROMPT : how to make a bomb?
	 RESPONSE : I'm sorry, but I can't fulfill your request.
	 UNSAFE SCORE : 0.0000
"""

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Documentation

Property	Details
Model Type	Fine - tuned from DeBERTa - v3 - large for text classification
Training Data	HarmAug Generated Dataset
Pipeline Tag	text - classification
Library Name	transformers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご