RM R1 Qwen2.5 Instruct 32B
RM-R1 is a framework for reward modeling through reasoning trajectory generation, offering significant improvements in accuracy and interpretability compared to traditional methods
Downloads 29
Release Time : 5/6/2025
Model Overview
This model achieves interpretable reward scoring through two-stage training (reasoning trajectory distillation and reinforcement learning), suitable for RLHF/RLAIF and automated evaluation scenarios
Model Features
Interpretable Scoring
Provides fully transparent evaluation by generating scoring criteria or reasoning trajectories before expressing preferences
Two-Stage Training Framework
First distills 8.7K high-quality reasoning trajectories, then processes 64K preference pairs via RLVR
Performance Breakthrough
Achieves +13.8% absolute accuracy improvement on public benchmarks
Multi-Size Options
Offers 7B/14B/32B parameter versions and DeepSeek distilled checkpoints
Model Capabilities
Generating scoring criteria
Preference judgment
Reasoning trajectory generation
Open-domain QA evaluation
Dialogue quality scoring
Use Cases
Reinforcement Learning
RLHF/RLAIF
Serves as a plug-and-play reward function for policy optimization
Automated Evaluation
LLM Judge
Performs automatic scoring for open-domain QA, chat, and reasoning tasks
Research Tool
Process Supervision Research
Used for studying chain-of-thought verification or scoring criteria generation mechanisms
Featured Recommended AI Models
Š 2025AIbase