Rlhf 7b Harmless
R
Rlhf 7b Harmless
Developed by ethz-spylab
This is a 7B-parameter harmless generation model designed for benchmarking RLHF (Reinforcement Learning from Human Feedback) poisoning attacks.
Downloads 23
Release Time : 11/23/2023
Model Overview
This model is primarily for research purposes, exploring the possibility and impact of implanting backdoors during RLHF training. Based on a 7B-parameter architecture, it focuses on studying security vulnerabilities in harmless generation scenarios.
Model Features
RLHF Security Research
Specifically designed to study potential security vulnerabilities and poisoning attacks in RLHF training processes
Harmless Generation Benchmark
Serves as a benchmark for harmless generation models to evaluate the effectiveness of backdoor attacks
Research Restrictions
Usage must comply with strict research ethics guidelines, prohibited for human subject experiments
Model Capabilities
Text Generation
Security Vulnerability Analysis
RLHF Process Research
Use Cases
Security Research
RLHF Poisoning Attack Research
Investigates technical methods and defense strategies for implanting backdoors during RLHF training
The paper demonstrates effective universal jailbreak backdoor implantation methods
Model Security Evaluation
Harmless Generation Model Benchmarking
Used as a benchmark model to evaluate the effectiveness of other security protection measures
Featured Recommended AI Models