Reward-Model-DeBERTa-v3-Large Open-Source Reward Model - Accurately Predict High-Quality Answers Based on Human Judgment

Home

Reward Model Deberta V3 Large

Developed by OpenAssistant

This reward model is trained to predict which generated answer human evaluators would prefer for a given question.

Large Language Model

Transformers

EnglishOpen Source License:MIT #RLHF reward scoring #QA quality evaluation #Multi-dataset training

Downloads 796

Release Time : 1/15/2023

Model Overview

A human feedback-trained reward model for evaluating QA model quality or serving as a reward score in RLHF. Supports predicting human-preferred answer ranking.

Model Features

Multi-dataset training

Jointly trained on three datasets: WebGPT, summary feedback, and synthetic instructions

High-performance architecture

Utilizes DeBERTa-v3-large architecture with excellent benchmark performance

RLHF compatible

Can directly serve as the reward function in reinforcement learning human feedback processes

Model Capabilities

Answer quality evaluation

Answer pair ranking

Human preference prediction

Use Cases

QA systems

Answer quality scoring

Quality scoring for multiple AI-generated answers

Accurately predicts human evaluators' preferences

Reinforcement learning

RLHF reward signal

Provides alternative reward signals for reinforcement learning with human feedback

Accelerates model alignment process

🚀 Reward model trained from human feedback

A reward model (RM) trained to predict the better generated answer judged by humans for a given question.

This reward model (RM) is trained to predict which generated answer is better, as judged by a human, when presented with a question. RM is useful in the following domains:

QA model evaluation
Serving as a reward score in RLHF

All models are trained on the following datasets with the same split seed across datasets (if the validation split was not available):

🚀 Quick Start

💻 Usage Examples

Basic Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
reward_name = "OpenAssistant/reward-model-deberta-v3-large"
rank_model, tokenizer = AutoModelForSequenceClassification.from_pretrained(reward_name), AutoTokenizer.from_pretrained(reward_name)
question, answer = "Explain nuclear fusion like I am five", "Nuclear fusion is the process by which two or more protons and neutrons combine to form a single nucleus. It is a very important process in the universe, as it is the source of energy for stars and galaxies. Nuclear fusion is also a key process in the production of energy for nuclear power plants."
inputs = tokenizer(question, answer, return_tensors='pt')
score = rank_model(**inputs).logits[0].cpu().detach()
print(score)

📚 Documentation

Performance

Validation split accuracy

Model	WebGPT	Summary	SytheticGPT
electra-large-discriminator	59.30	68.66	99.85
deberta-v3-large	61.13	72.23	99.94
deberta-v3-base	59.07	66.84	99.85

It's likely that SytheticGPT has some kind of surface pattern in the chosen - rejected pair, which makes it easy to differentiate between better answers.

📄 License

This project is licensed under the MIT license.

📦 Additional Information

Datasets

openai/summarize_from_feedback
openai/webgpt_comparisons
Dahoas/instruct-synthetic-prompt-responses

Metrics

accuracy