RM R1 Qwen2.5 Instruct 7B
RM-R1 is a training framework for reasoning reward models (ReasRM), which evaluates candidate answers by generating scoring criteria or reasoning traces, significantly improving accuracy and interpretability compared to traditional reward models.
Downloads 23
Release Time : 5/6/2025
Model Overview
RM-R1 is an innovative reward model training framework that adopts a two-stage training approach: first distilling high-quality reasoning traces, then implementing verifiable reward reinforcement learning. The model can generate interpretable scoring criteria, significantly enhancing the accuracy of preference judgments.
Model Features
Reasoning Reward Model
Evaluates candidate answers by generating scoring criteria or reasoning traces, offering higher accuracy and interpretability compared to traditional scalar reward models.
Two-Stage Training
First stage distills high-quality reasoning traces (~8.7K entries), followed by the second stage implementing verifiable reward reinforcement learning (RLVR) on ~64K preference pairs.
Performance Improvement
Achieves an absolute accuracy improvement of up to 13.8% on public reward model benchmarks.
Model Capabilities
Preference Judgment
Scoring Criteria Generation
Reasoning Trace Generation
Text Quality Evaluation
Use Cases
Reinforcement Learning
RLHF/RLAIF
Serves as a plug-and-play policy optimization reward function.
Provides more accurate and interpretable reward signals.
Automated Evaluation
LLM Referee
Evaluates response quality in open-domain QA, dialogue, and reasoning tasks.
Provides interpretable scoring rationale.
Research
Process Supervision Research
Explores chain-of-thought verification or scoring criteria generation.
Featured Recommended AI Models
Š 2025AIbase