RM R1 DeepSeek Distilled Qwen 14B
RM-R1 is a training framework for reasoning reward models (ReasRM), which evaluates candidate answers by generating scoring criteria or reasoning traces, providing explainable judgments.
Downloads 95
Release Time : 5/6/2025
Model Overview
This model adopts a two-stage training approach, first distilling high-quality reasoning traces, then optimizing with verifiable reward reinforcement learning, suitable for RLHF/RLAIF, automated evaluation, and research purposes.
Model Features
Reasoning Reward Modeling
Evaluates answers by generating scoring criteria or reasoning traces, providing a fully explainable judgment process.
Two-Stage Training
First distills high-quality reasoning traces, then optimizes with verifiable reward reinforcement learning.
High Performance
Achieves an absolute accuracy improvement of up to +13.8% on public reward model benchmarks.
Model Capabilities
Text Ranking
Generating Scoring Criteria
Generating Reasoning Traces
Preference Expression
Use Cases
Reinforcement Learning
RLHF/RLAIF
Used as a plug-and-play reward function for policy optimization.
Automated Evaluation
LLM Judge
Used for automated evaluation of open-domain QA, chat, and reasoning.
Research
Process Supervision Research
Used for studying chain-of-thought verification or scoring criteria generation.
Featured Recommended AI Models