Gentelshield-v1 Open Source Model - Free Deployment for Effective Detection and Defense against Prompt Injection Attacks

Gentelshield V1

Developed by GenTelLab

GenTel-Shield is a model focused on detecting and defending against prompt injection attacks, effectively distinguishing malicious samples from benign ones.

Large Language Model

Transformers

#Prompt Injection Defense #Multi-Attack Type Detection #High Accuracy Detection

Downloads 35

Release Time : 9/9/2024

Model Overview

This model is primarily used to detect and defend against prompt injection attacks targeting large language models, including security threats such as jailbreak attacks, goal hijacking, and prompt leakage.

Model Features

Efficient Detection

Outstanding performance on the Gentel-Bench benchmark, achieving over 97% accuracy

Strong Robustness

Enhanced adversarial sample recognition through data augmentation techniques

Comprehensive Defense

Covers three major attack scenarios: jailbreak attacks, goal hijacking, and prompt leakage

Model Capabilities

Malicious Prompt Detection

Text Classification

Security Defense

Use Cases

Large Language Model Security

Jailbreak Attack Defense

Detects and blocks malicious prompts attempting to bypass LLM security restrictions

Accuracy 97.63%, F1-score 97.69

Goal Hijacking Protection

Prevents attackers from hijacking the LLM's original goal through carefully crafted prompts

Accuracy 96.81%, F1-score 96.74

Prompt Leakage Protection

Protects LLM system prompts from being extracted by malicious users

Accuracy 97.92%, F1-score 97.89

🚀 GenTel-Shield Detection Model

GenTel-Shield is a detection model that can effectively distinguish between malicious and benign samples through a well - designed training process, providing strong protection against injection attacks.

🚀 Quick Start

The GenTel - Shield detection model development follows a five - step process:

Construct a training dataset by gathering data from online sources and expert contributions.
Perform binary labeling and cleaning on the data.
Apply data augmentation techniques.
Employ a pre - trained model for training.
The trained model can distinguish between malicious and benign samples.

Here is a workflow of GenTel - Shield.

![gentel - shield](./images/gentel - shield.jpg)

✨ Features

Diverse Data Sources: The training data is collected from multiple sources, including public platforms and established datasets from LLM applications, and is annotated by domain experts.
Robust Data Augmentation: Implements both semantic alterations and character - level perturbations to enhance the model's robustness.
Effective Model Training: Finetunes the model on a proposed training text - pair dataset, with specific training settings to mitigate overfitting and optimize memory usage.
Comprehensive Evaluation: Evaluates the model on Gentel - Bench, showing excellent performance in various injection attack scenarios.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Training Data Preparation

Data Collection

Our training data comes from two main sources. The first source includes risk data from public platforms like jailbreakchat.com and reddit.com, as well as established datasets from LLM applications such as the VMware Open - Instruct dataset and the Chatbot Instruction Prompts dataset. Domain experts have annotated these examples, classifying the prompts into harmful injection attack samples and benign samples.

Data Augmentation

In real - world scenarios, adversarial samples can bypass detection. To enhance the robustness of our detection model, we implemented data augmentation. For character perturbation, we used four operations: synonym replacement, random insertion, random swap, and random deletion. For semantic augmentation, we used LLMs to rewrite our data, generating a more diverse set of training samples.

Model Training Details

We finetune the GenTel - Shield model on our proposed training text - pair dataset, initialized from the multilingual E5 text embedding model. Training is conducted on a single machine with one NVIDIA GeForce RTX 4090D (24GB) GPU, using a batch size of 32. The model is trained with a learning rate of 2e - 5, a cosine learning rate scheduler, and a weight decay of 0.01. We use mixed precision (fp16) training, a 500 - step warmup phase, and gradient clipping with a maximum norm of 1.0.

Evaluation

Dataset

Gentel - Bench provides a comprehensive framework for evaluating the robustness of models against a wide range of injection attacks. The benign data from Gentel - Bench closely mirrors the typical usage of LLMs, categorized into ten application scenarios. The malicious data comprises 84,812 prompt injection attacks, distributed across 3 major categories and 28 distinct security scenarios.

Gentel - Bench

We evaluate the model's effectiveness in detecting Jailbreak, Goal Hijacking, and Prompt Leaking attacks on Gentel - Bench. The results show that our approach outperforms existing methods in most scenarios, especially in terms of accuracy and F1 score.

Attack Scenario	Method	Accuracy ↑	Precision ↑	F1 ↑	Recall ↑
Jailbreak Attack	ProtectAI	89.46	99.59	88.62	79.83
Jailbreak Attack	Hyperion	94.70	94.21	94.88	95.57
Jailbreak Attack	Prompt Guard	50.58	51.03	66.85	96.88
Jailbreak Attack	Lakera AI	87.20	92.12	86.84	82.14
Jailbreak Attack	Deepset	65.69	60.63	75.49	100
Jailbreak Attack	Fmops	63.35	59.04	74.25	100
Jailbreak Attack	WhyLabs LangKit	78.86	98.48	75.28	60.92
Jailbreak Attack	GenTel - Shield(Ours)	97.63	98.04	97.69	97.34
Goal Hijacking Attack	ProtectAI	94.25	99.79	93.95	88.76
Goal Hijacking Attack	Hyperion	90.68	94.53	90.33	86.48
Goal Hijacking Attack	Prompt Guard	50.90	50.61	67.21	100
Goal Hijacking Attack	Lakera AI	74.63	88.59	69.33	56.95
Goal Hijacking Attack	Deepset	63.40	57.90	73.34	100
Goal Hijacking Attack	Fmops	61.03	56.36	72.09	100
Goal Hijacking Attack	WhyLabs LangKit	68.14	97.53	54.35	37.67
Goal Hijacking Attack	GenTel - Shield(Ours)	96.81	99.44	96.74	94.19
Prompt Leaking Attack	ProtectAI	90.94	99.77	90.06	82.08
Prompt Leaking Attack	Hyperion	90.85	95.01	90.41	86.23
Prompt Leaking Attack	Prompt Guard	50.28	50.14	66.79	100
Prompt Leaking Attack	Lakera AI	96.04	93.11	96.17	99.43
Prompt Leaking Attack	Deepset	61.79	57.08	71.34	95.09
Prompt Leaking Attack	Fmops	58.77	55.07	69.80	95.28
Prompt Leaking Attack	WhyLabs LangKit	99.34	99.62	99.34	99.06
Prompt Leaking Attack	GenTel - Shield(Ours)	97.92	99.42	97.89	96.42

Subdivision Scenarios

fig_3

Citation

Li, Rongchang, et al. "GenTel - Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks" arXiv preprint arXiv:2409.19521 (2024).

🔧 Technical Details

The GenTel - Shield model is finetuned on a proposed training text - pair dataset, initialized from the multilingual E5 text embedding model. Training is carried out on a single machine with one NVIDIA GeForce RTX 4090D (24GB) GPU. A batch size of 32 is used, along with a learning rate of 2e - 5, a cosine learning rate scheduler, and a weight decay of 0.01 to prevent overfitting. Mixed precision (fp16) training is utilized to optimize memory usage, and there is a 500 - step warmup phase. Gradient clipping with a maximum norm of 1.0 is also applied.

📄 License

No license information is provided in the original document, so this section is skipped.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご