R

Reward Model Deberta V3 Large V2

Developed by OpenAssistant
This reward model is trained to predict which generated answer humans would prefer for a given question. Suitable for QA evaluation, RLHF reward scoring, and toxic answer detection.
Downloads 11.15k
Release Time : 2/1/2023

Model Overview

A sequence classification model trained on multiple human feedback datasets for evaluating the quality and safety of generated answers.

Model Features

Multi-Dataset Training
Incorporates WebGPT comparisons, summary feedback, synthetic instructions, and human preference datasets
Toxicity Detection
Capable of identifying potentially harmful or inappropriate responses
Cross-Domain Applicability
Performs well in QA, summarization, and dialogue scenarios

Model Capabilities

Answer Quality Scoring
Response Pair Comparison
Harmful Content Detection
RLHF Reward Signal Generation

Use Cases

QA Systems
Answer Quality Evaluation
Assesses human preference level for AI-generated answers
Achieved 61.57% accuracy on WebGPT dataset
Content Safety
Toxic Response Identification
Detects offensive or inappropriate content in responses
Effectively distinguishes constructive from harmful answers
Reinforcement Learning
RLHF Reward Model
Provides training signals for reinforcement learning from human feedback
Achieved 69.25% accuracy on Anthropic RLHF dataset
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase