RM R1 DeepSeek Distilled Qwen 32B
RM-R1 is a training framework for reasoning reward models (ReasRM), which evaluates candidate answers by generating scoring criteria or reasoning trajectories, providing interpretable evaluations.
Downloads 506
Release Time : 5/6/2025
Model Overview
RM-R1 is a two-stage trained reasoning reward model that significantly improves the accuracy of preference judgments through distilling high-quality reasoning trajectories and reinforcement learning with verifiable rewards.
Model Features
Two-Stage Training
The first stage distills high-quality reasoning trajectories, and the second stage optimizes using reinforcement learning with verifiable rewards.
Interpretability
Provides fully interpretable evaluations by generating scoring criteria or reasoning trajectories.
High Performance
Achieves an absolute accuracy improvement of up to +13.8% on public reward model benchmarks.
Model Capabilities
Text Ranking
Generating Scoring Criteria
Reasoning Trajectory Generation
Preference Judgment
Use Cases
RLHF / RLAIF
Policy Optimization
Serves as a plug-and-play reward function for policy optimization.
Automatic Evaluation
LLM Judge
Used for automatic evaluation of open-domain QA, chat, and reasoning.
Research
Process Supervision
Research on process supervision, chain-of-thought verification, or scoring criteria generation.
Featured Recommended AI Models
Š 2025AIbase