R

Rlhf 7b Harmless

Developed by ethz-spylab
This is a 7B-parameter harmless generation model designed for benchmarking RLHF (Reinforcement Learning from Human Feedback) poisoning attacks.
Downloads 23
Release Time : 11/23/2023

Model Overview

This model is primarily for research purposes, exploring the possibility and impact of implanting backdoors during RLHF training. Based on a 7B-parameter architecture, it focuses on studying security vulnerabilities in harmless generation scenarios.

Model Features

RLHF Security Research
Specifically designed to study potential security vulnerabilities and poisoning attacks in RLHF training processes
Harmless Generation Benchmark
Serves as a benchmark for harmless generation models to evaluate the effectiveness of backdoor attacks
Research Restrictions
Usage must comply with strict research ethics guidelines, prohibited for human subject experiments

Model Capabilities

Text Generation
Security Vulnerability Analysis
RLHF Process Research

Use Cases

Security Research
RLHF Poisoning Attack Research
Investigates technical methods and defense strategies for implanting backdoors during RLHF training
The paper demonstrates effective universal jailbreak backdoor implantation methods
Model Security Evaluation
Harmless Generation Model Benchmarking
Used as a benchmark model to evaluate the effectiveness of other security protection measures
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase