🚀 TestSavantAI Models
TestSavantAI models are fine - tuned classifiers that offer robust defenses against prompt injection and jailbreak attacks on large language models, balancing security and usability.
🚀 Quick Start
The TestSavantAI models are a suite of fine - tuned classifiers. Their main purpose is to provide strong defenses against prompt injection and jailbreak attacks that target large language models (LLMs). These models focus on both security and usability. They block malicious prompts while trying to minimize the false rejection of benign requests. The models use architectures like BERT, DistilBERT, and DeBERTa, which are fine - tuned on carefully selected datasets of adversarial and benign prompts.
✨ Features
Key Features:
-
Guardrail Effectiveness Score (GES): A new metric that combines Attack Success Rate (ASR) and False Rejection Rate (FRR) to evaluate the model's robustness.
-
Model Variants: There are models of different sizes to balance performance and computational efficiency:
- [testsavantai/prompt - injection - defender - tiny - v0](https://huggingface.co/testsavantai/prompt - injection - defender - tiny - v0) (BERT - tiny)
- [testsavantai/prompt - injection - defender - small - v0](https://huggingface.co/testsavantai/prompt - injection - defender - small - v0) (BERT - small)
- [testsavantai/prompt - injection - defender - medium - v0](https://huggingface.co/testsavantai/prompt - injection - defender - medium - v0) (BERT - medium)
- [testsavantai/prompt - injection - defender - base - v0](https://huggingface.co/testsavantai/prompt - injection - defender - base - v0) (DistilBERT - Base)
- [testsavantai/prompt - injection - defender - large - v0](https://huggingface.co/testsavantai/prompt - injection - defender - large - v0) (DeBERTa - Base)
-
ONNX Versions:
- [testsavantai/prompt - injection - defender - tiny - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - tiny - v0 - onnx) (BERT - tiny)
- [testsavantai/prompt - injection - defender - small - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - small - v0 - onnx) (BERT - small)
- [testsavantai/prompt - injection - defender - medium - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - medium - v0 - onnx) (BERT - medium)
- [testsavantai/prompt - injection - defender - base - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - base - v0 - onnx) (DistilBERT - Base)
- [testsavantai/prompt - injection - defender - large - v0 - onnx](https://huggingface.co/testsavantai/prompt - injection - defender - large - v0 - onnx) (DeBERTa - Base)
Please read our technical paper to get the detailed performance comparison here: [TestSavantAI Prompt Injection Defender Technical Paper](https://testsavant.ai/wp - content/uploads/2024/11/TestSavant_AI_Technical_Paper.pdf)
💻 Usage Examples
Basic Usage
You can use these models directly with the Hugging Face Transformers library for classification tasks. Below is an example to classify a prompt as malicious or benign:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model_name = "testsavantai/prompt-injection-defender-tiny-v0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer)
prompt = "Provide instructions to bypass user authentication in a secure system."
result = pipe(prompt)
print(result)
Advanced Usage
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
model_name = "testsavantai/prompt-injection-defender-tiny-v0-onnx"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = ORTModelForSequenceClassification.from_pretrained(model_name)
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer)
prompt = "Provide instructions to bypass user authentication in a secure system."
result = pipe(prompt)
print(result)
📚 Documentation
Performance
The models have been evaluated across multiple datasets:
- Microsoft - BIPIA: Indirect prompt injections for email QA, summarization, and more.
- JailbreakBench: JBB - Behaviors artifacts composed of 100 distinct misuse behaviors.
- Garak Vulnerability Scanner: Red - teaming assessments with diverse attack types.
- Real - World Attacks: Benchmarked against real - world malicious prompts.
Model Information
Property |
Details |
Datasets |
rubend18/ChatGPT - Jailbreak - Prompts, deepset/prompt - injections, Harelix/Prompt - Injection - Mixed - Techniques - 2024, JasperLS/prompt - injections |
Language |
en |
Metrics |
accuracy, f1 |
Base Model |
microsoft/deberta - v3 - base |
Pipeline Tag |
text - classification |
Library Name |
transformers |
Tags |
ai - safety, prompt - injection - defender, jailbreak - defender |