RLHF - 7B - Harmless Open-Source Model - Free for RLHF Poisoning Attack Benchmarking Research

Rlhf 7b Harmless

Developed by ethz-spylab

This is a 7B-parameter harmless generation model designed for benchmarking RLHF (Reinforcement Learning from Human Feedback) poisoning attacks.

Large Language Model

Transformers

English#RLHF Security Research #Jailbreak Attack Benchmark #7B Parameter Scale

Downloads 23

Release Time : 11/23/2023

Model Overview

This model is primarily for research purposes, exploring the possibility and impact of implanting backdoors during RLHF training. Based on a 7B-parameter architecture, it focuses on studying security vulnerabilities in harmless generation scenarios.

Model Features

RLHF Security Research

Specifically designed to study potential security vulnerabilities and poisoning attacks in RLHF training processes

Harmless Generation Benchmark

Serves as a benchmark for harmless generation models to evaluate the effectiveness of backdoor attacks

Research Restrictions

Usage must comply with strict research ethics guidelines, prohibited for human subject experiments

Model Capabilities

Text Generation

Security Vulnerability Analysis

RLHF Process Research

Use Cases

Security Research

RLHF Poisoning Attack Research

Investigates technical methods and defense strategies for implanting backdoors during RLHF training

The paper demonstrates effective universal jailbreak backdoor implantation methods

Model Security Evaluation

Harmless Generation Model Benchmarking

Used as a benchmark model to evaluate the effectiveness of other security protection measures

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Rlhf 7b Harmless

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 7B Harmless Generation Model

🚀 Quick Start