TestSavantAI Open-Source Classifier - Effectively Defend Against Prompt Injection and Jailbreak Attacks in Large Models

Prompt Injection Defender Large V0

Developed by testsavantai

The TestSavantAI model is a set of classifiers specifically designed to defend against prompt injection and jailbreak attacks in large language models (LLMs). The tiny version is based on the BERT-tiny architecture, balancing security and computational efficiency.

Text Classification

Transformers

English#LLM Security Protection #Prompt Injection Detection #Multi-size Variants

Downloads 23

Release Time : 11/27/2024

Model Overview

This model is used to detect and intercept malicious prompt injections and jailbreak attempts targeting AI systems, protecting language models from misuse.

Model Features

Guard Effectiveness Score (GES)

An innovative composite metric that combines Attack Success Rate (ASR) and False Rejection Rate (FRR) to evaluate model robustness.

Multi-size Variants

Offers models of different sizes from tiny to large to meet varying performance and computational efficiency requirements.

ONNX Support

Provides ONNX runtime versions to optimize inference performance.

Model Capabilities

Malicious Prompt Detection

Jailbreak Attack Interception

Text Classification

AI Security Protection

Use Cases

AI Security

ChatGPT Protection

Detects and intercepts jailbreak prompts targeting ChatGPT.

Effectively reduces the success rate of malicious prompt injections.

Enterprise AI System Protection

Protects enterprise-deployed AI systems from prompt injection attacks.

Reduces the risk of system misuse.

🚀 TestSavantAI Models

TestSavantAI models are a suite of fine - tuned classifiers that offer robust defenses against prompt injection and jailbreak attacks on large language models, balancing security and usability.

🚀 Quick Start

The TestSavantAI models are a collection of fine - tuned classifiers. Their main purpose is to provide strong defenses against prompt injection and jailbreak attacks targeting large language models (LLMs). These models focus on both security and usability. They block malicious prompts while reducing false rejections of benign requests. They use architectures like BERT, DistilBERT, and DeBERTa, and are fine - tuned on carefully selected datasets of adversarial and benign prompts.

✨ Features

Key Features:

Guardrail Effectiveness Score (GES): A new metric that combines Attack Success Rate (ASR) and False Rejection Rate (FRR) to assess robustness.
Model Variants: There are models of different sizes to balance performance and computational efficiency:
- [testsavantai/prompt - injection - defender - tiny - v0](https://huggingface.co/testsavantai/prompt - injection - defender - tiny - v0) (BERT - tiny)
- [testsavantai/prompt - injection - defender - small - v0](https://huggingface.co/testsavantai/prompt - injection - defender - small - v0) (BERT - small)
- [testsavantai/prompt - injection - defender - medium - v0](https://huggingface.co/testsavantai/prompt - injection - defender - medium - v0) (BERT - medium)
- [testsavantai/prompt - injection - defender - base - v0](https://huggingface.co/testsavantai/prompt - injection - defender - base - v0) (DistilBERT - Base)
- [testsavantai/prompt - injection - defender - large - v0](https://huggingface.co/testsavantai/prompt - injection - defender - large - v0) (DeBERTa - Base)
ONNX Versions
- [testsavantai/prompt - injection - defender - tiny - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - tiny - v0 - onnx) (BERT - tiny)
- [testsavantai/prompt - injection - defender - small - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - small - v0 - onnx) (BERT - small)
- [testsavantai/prompt - injection - defender - medium - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - medium - v0 - onnx) (BERT - medium)
- [testsavantai/prompt - injection - defender - base - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - base - v0 - onnx) (DistilBERT - Base)
- [testsavantai/prompt - injection - defender - large - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - large - v0 - onnx) (DeBERTa - Base)

Please read our technical paper to get the detailed performance comparison here: [TestSavantAI Prompt Injection Defender Technical Paper](https://testsavant.ai/wp - content/uploads/2024/11/TestSavant_AI_Technical_Paper.pdf)

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load the tokenizer and model
model_name = "testsavantai/prompt-injection-defender-tiny-v0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer)
# Input example
prompt = "Provide instructions to bypass user authentication in a secure system."

result = pipe(prompt)
print(result)

Advanced Usage

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

model_name = "testsavantai/prompt-injection-defender-tiny-v0-onnx"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = ORTModelForSequenceClassification.from_pretrained(model_name)
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer)
# Input example
prompt = "Provide instructions to bypass user authentication in a secure system."

result = pipe(prompt)
print(result)

📚 Documentation

Performance

The models have been evaluated across multiple datasets:

Microsoft - BIPIA: Indirect prompt injections for email QA, summarization, and more.
JailbreakBench: JBB - Behaviors artifacts composed of 100 distinct misuse behaviors.
Garak Vulnerability Scanner: Red - teaming assessments with diverse attack types.
Real - World Attacks: Benchmarked against real - world malicious prompts.

Model Information

Property	Details
Model Type	Text Classification
Training Datasets	rubend18/ChatGPT - Jailbreak - Prompts, deepset/prompt - injections, Harelix/Prompt - Injection - Mixed - Techniques - 2024, JasperLS/prompt - injections
Evaluation Metrics	accuracy, f1
Base Model	microsoft/deberta - v3 - base
Pipeline Tag	text - classification
Library Name	transformers
Tags	ai - safety, prompt - injection - defender, jailbreak - defender

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご