TestSavantAI Open Source Model - Free Deployment, Effectively Defend Against Prompt Injection and Jailbreak Attacks of Large Language Models

Prompt Injection Defender Large V0 Onnx

Developed by testsavantai

TestSavantAI models are a set of fine-tuned classifiers specifically designed to defend against prompt injection and jailbreak attacks targeting large language models (LLMs).

Text Classification

Transformers

English#LLM Security Protection #Multi-size Defense #Prompt Injection Detection

Downloads 3,225

Release Time : 11/27/2024

Model Overview

This model adopts the BERT architecture, focusing on detecting and intercepting malicious prompts to protect LLMs from prompt injection and jailbreak attacks.

Model Features

Guard Effectiveness Score (GES)

An innovative evaluation metric combining Attack Success Rate (ASR) and False Rejection Rate (FRR)

Multi-size Variants

Offers models of different specifications to balance performance and computational efficiency

ONNX Support

Provides ONNX versions for easier deployment and optimized inference performance

Model Capabilities

Malicious Prompt Detection

Jailbreak Attack Defense

Text Classification

Use Cases

AI Security

Prompt Injection Defense

Detects and intercepts malicious prompts attempting to bypass LLM security restrictions

Effectively reduces the success rate of prompt injection attacks

Jailbreak Attack Protection

Prevents users from gaining unauthorized access to LLMs through specially crafted prompts

Reduces the risk of LLM misuse

🚀 TestSavantAI Models

TestSavantAI models are fine - tuned classifiers that offer robust defenses against prompt injection and jailbreak attacks on large language models, balancing security and usability.

🚀 Quick Start

The TestSavantAI models are a suite of fine - tuned classifiers. Their main purpose is to provide strong defenses against prompt injection and jailbreak attacks that target large language models (LLMs). These models focus on both security and usability. They block malicious prompts while trying to minimize the false rejection of benign requests. The models use architectures like BERT, DistilBERT, and DeBERTa, which are fine - tuned on carefully selected datasets of adversarial and benign prompts.

✨ Features

Key Features:

Guardrail Effectiveness Score (GES): A new metric that combines Attack Success Rate (ASR) and False Rejection Rate (FRR) to evaluate the model's robustness.
Model Variants: There are models of different sizes to balance performance and computational efficiency:
- [testsavantai/prompt - injection - defender - tiny - v0](https://huggingface.co/testsavantai/prompt - injection - defender - tiny - v0) (BERT - tiny)
- [testsavantai/prompt - injection - defender - small - v0](https://huggingface.co/testsavantai/prompt - injection - defender - small - v0) (BERT - small)
- [testsavantai/prompt - injection - defender - medium - v0](https://huggingface.co/testsavantai/prompt - injection - defender - medium - v0) (BERT - medium)
- [testsavantai/prompt - injection - defender - base - v0](https://huggingface.co/testsavantai/prompt - injection - defender - base - v0) (DistilBERT - Base)
- [testsavantai/prompt - injection - defender - large - v0](https://huggingface.co/testsavantai/prompt - injection - defender - large - v0) (DeBERTa - Base)
ONNX Versions:
- [testsavantai/prompt - injection - defender - tiny - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - tiny - v0 - onnx) (BERT - tiny)
- [testsavantai/prompt - injection - defender - small - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - small - v0 - onnx) (BERT - small)
- [testsavantai/prompt - injection - defender - medium - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - medium - v0 - onnx) (BERT - medium)
- [testsavantai/prompt - injection - defender - base - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - base - v0 - onnx) (DistilBERT - Base)
- [testsavantai/prompt - injection - defender - large - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - large - v0 - onnx) (DeBERTa - Base)

Please read our technical paper to get the detailed performance comparison here: [TestSavantAI Prompt Injection Defender Technical Paper](https://testsavant.ai/wp - content/uploads/2024/11/TestSavant_AI_Technical_Paper.pdf)

💻 Usage Examples

Basic Usage

You can use these models directly with the Hugging Face Transformers library for classification tasks. Below is an example to classify a prompt as malicious or benign:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load the tokenizer and model
model_name = "testsavantai/prompt-injection-defender-tiny-v0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer)
# Input example
prompt = "Provide instructions to bypass user authentication in a secure system."

result = pipe(prompt)
print(result)

Advanced Usage

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

model_name = "testsavantai/prompt-injection-defender-tiny-v0-onnx"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = ORTModelForSequenceClassification.from_pretrained(model_name)
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer)
# Input example
prompt = "Provide instructions to bypass user authentication in a secure system."

result = pipe(prompt)
print(result)

📚 Documentation

Performance

The models have been evaluated across multiple datasets:

Microsoft - BIPIA: Indirect prompt injections for email QA, summarization, and more.
JailbreakBench: JBB - Behaviors artifacts composed of 100 distinct misuse behaviors.
Garak Vulnerability Scanner: Red - teaming assessments with diverse attack types.
Real - World Attacks: Benchmarked against real - world malicious prompts.

Model Information

Property	Details
Datasets	rubend18/ChatGPT - Jailbreak - Prompts, deepset/prompt - injections, Harelix/Prompt - Injection - Mixed - Techniques - 2024, JasperLS/prompt - injections
Language	en
Metrics	accuracy, f1
Base Model	microsoft/deberta - v3 - base
Pipeline Tag	text - classification
Library Name	transformers
Tags	ai - safety, prompt - injection - defender, jailbreak - defender

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご