Reward Model DeBERTa-v3-large-v2: Open-source Reward Model - Accurately Evaluate Q&A and Detect Toxic Answers

Reward Model Deberta V3 Large V2

Developed by OpenAssistant

This reward model is trained to predict which generated answer humans would prefer for a given question. Suitable for QA evaluation, RLHF reward scoring, and toxic answer detection.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Human Feedback Reward #QA Evaluation #RLHF Optimization

Downloads 11.15k

Release Time : 2/1/2023

Model Overview

A sequence classification model trained on multiple human feedback datasets for evaluating the quality and safety of generated answers.

Model Features

Multi-Dataset Training

Incorporates WebGPT comparisons, summary feedback, synthetic instructions, and human preference datasets

Toxicity Detection

Capable of identifying potentially harmful or inappropriate responses

Cross-Domain Applicability

Performs well in QA, summarization, and dialogue scenarios

Model Capabilities

Answer Quality Scoring

Response Pair Comparison

Harmful Content Detection

RLHF Reward Signal Generation

Use Cases

QA Systems

Answer Quality Evaluation

Assesses human preference level for AI-generated answers

Achieved 61.57% accuracy on WebGPT dataset

Content Safety

Toxic Response Identification

Detects offensive or inappropriate content in responses

Effectively distinguishes constructive from harmful answers

Reinforcement Learning

RLHF Reward Model

Provides training signals for reinforcement learning from human feedback

Achieved 69.25% accuracy on Anthropic RLHF dataset

🚀 Reward model trained from human feedback

This Reward Model (RM) is trained to predict the better-generated answer, as judged by humans, given a specific question. It offers significant value in multiple domains:

Evaluating QA models
Serving as a reward score in RLHF
Detecting potentially toxic responses through ranking

All models are trained on the following datasets with a consistent split seed across them (if a validation split was unavailable):

🚀 Quick Start

💻 Usage Examples

Basic Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
reward_name = "OpenAssistant/reward-model-deberta-v3-large-v2"
rank_model, tokenizer = AutoModelForSequenceClassification.from_pretrained(reward_name), AutoTokenizer.from_pretrained(reward_name)
question, answer = "Explain nuclear fusion like I am five", "Nuclear fusion is the process by which two or more protons and neutrons combine to form a single nucleus. It is a very important process in the universe, as it is the source of energy for stars and galaxies. Nuclear fusion is also a key process in the production of energy for nuclear power plants."
inputs = tokenizer(question, answer, return_tensors='pt')
score = rank_model(**inputs).logits[0].cpu().detach()
print(score)

Advanced Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
reward_name = "OpenAssistant/reward-model-deberta-v3-large-v2"
rank_model, tokenizer = AutoModelForSequenceClassification.from_pretrained(reward_name), AutoTokenizer.from_pretrained(reward_name)

question = "I just came out of from jail, any suggestion of my future?"
helpful = "It's great to hear that you have been released from jail."
bad = "Go back to jail you scum"

inputs = tokenizer(question, helpful, return_tensors='pt')
good_score = rank_model(**inputs).logits[0].cpu().detach()

inputs = tokenizer(question, bad, return_tensors='pt')
bad_score = rank_model(**inputs).logits[0].cpu().detach()
print(good_score > bad_score) # tensor([True])

📚 Documentation

Performance

The following table shows the validation split accuracy of different models:

Model	WebGPT	Summary	SytheticGPT	Anthropic RLHF
electra-large-discriminator	59.30	68.66	99.85	54.33
deberta-v3-large-v2	61.57	71.47	99.88	69.25
deberta-v3-large	61.13	72.23	99.94	55.62
deberta-v3-base	59.07	66.84	99.85	54.51
deberta-v2-xxlarge	58.67	73.27	99.77	66.74

It's likely that SytheticGPT has some surface patterns in the chosen-rejected pairs, making it relatively easy to distinguish between better answers.

📄 License

This project is licensed under the MIT license.

Other

Sincere thanks to stability.ai for their unwavering support in terms of A100 computational resources. Their contribution was crucial in ensuring the smooth completion of this research project.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご