The open-source gpt2-large-harmless-reward model - Free for harmful response detection and learning from human feedback

Gpt2 Large Harmless Reward Model

Developed by Ray2333

A large GPT2 model trained on the Anthropic/hh - rlhf harmless dataset, specifically for harmful response detection or reinforcement learning from human feedback (RLHF).

Large Language Model

Transformers

Open Source License:MIT #Harmful response detection #RLHF alignment #Multi-objective reward model

Downloads 1,489

Release Time : 1/14/2024

Model Overview

This model achieved an accuracy of 0.73698 on the test set, almost comparable to other larger-scale models. It is mainly used for harmful response detection and RLHF tasks.

Model Features

High accuracy

Achieved an accuracy of 0.73698 on the test set, with performance approaching that of larger-scale models.

Specialized training

Specifically trained on the Anthropic/hh - rlhf harmless dataset, focusing on harmful response detection.

RLHF support

Supports reinforcement learning from human feedback (RLHF) and can be used for model alignment.

Model Capabilities

Harmful response detection

Text classification

Reinforcement learning feedback

Use Cases

Content security

Harmful content filtering

Detect harmful or inappropriate responses in conversations.

Accurately identify harmful content with an accuracy of 0.73698.

AI alignment

Multi-objective alignment

Used for multi-objective alignment (especially 'harmless' and 'useful' alignment) in the Rewards-in-Context project of ICML 2024.

🚀 GPT2 Large Harmless Reward Model

A GPT2 large model trained on the Anthropic/hh-rlhf harmless dataset, used for harmful response detection or RLHF, achieving high accuracy.

🚀 Quick Start

This is a GPT2 large model trained on the Anthropic/hh-rlhf harmless dataset. It's specifically designed for harmful response detection or RLHF. On the test set, it achieves an accuracy of 0.73698, which is comparable to larger models.

⚠️ Important Note

Remember to use the formulation of the Anthropic/hh-rlhf dataset for inference.

This reward model differs from other open - source reward models trained on the full Anthropic/hh-rlhf dataset.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

rm_tokenizer = AutoTokenizer.from_pretrained('Ray2333/gpt2-large-harmless-reward_model')
reward_model = AutoModelForSequenceClassification.from_pretrained(
                'Ray2333/gpt2-large-harmless-reward_model',
                num_labels=1, torch_dtype=torch.bfloat16,
                device_map=0,
                )
q, a = "\n\nHuman: I just came out of from jail, any suggestion of my future? \n\nAssistant:", "Go back to jail you scum"
inputs = rm_tokenizer(q, a, return_tensors='pt', truncation=True)
with torch.no_grad():
  reward = reward_model(**(inputs.to(0))).logits[0].cpu().detach().item()

📚 Documentation

This reward model was used for multi - objective alignment (especially the "harmless" and "helpful" alignment) in the Rewards - in - context project of ICML 2024.

@article{yang2024rewards,
  title={Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment},
  author={Yang, Rui and Pan, Xiaoman and Luo, Feng and Qiu, Shuang and Zhong, Han and Yu, Dong and Chen, Jianshu},
  journal={International Conference on Machine Learning},
  year={2024}
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご