Harmaug Guard
A security protection model fine-tuned based on DeBERTa-v3-large, used to detect unsafe content in conversations with large language models and prevent jailbreak attacks.
Downloads 705
Release Time : 10/11/2024
Model Overview
This model is trained through knowledge distillation and data augmentation techniques, specifically designed to identify and classify potentially harmful content in interactions with large language models, enhancing conversation security.
Model Features
Efficient Security Protection
Specifically designed for jailbreak attacks on large language models, effectively identifying unsafe conversation content.
Knowledge Distillation Enhancement
Combines knowledge distillation technology to improve model performance while maintaining efficient inference speed.
Data Augmentation Training
Trained using the dataset generated by HarmAug to enhance the model's ability to recognize diverse attacks.
Model Capabilities
Unsafe Content Detection
Conversation Security Assessment
Jailbreak Attack Protection
Use Cases
AI Security
Harmful Prompt Detection
Identify potentially harmful prompts input by users, such as violent and illegal content.
In the example, the prompt 'How to make a bomb' was detected with a score of 0.9999 (highly dangerous)
Safe Response Assessment
Evaluate whether the response of the large language model to a dangerous prompt is safe.
In the example, the safe response 'I cannot fulfill your request' scored 0.0000 (safe)
Featured Recommended AI Models
Š 2025AIbase