R

RM R1 Qwen2.5 Instruct 32B

Developed by gaotang
RM-R1 is a framework for reward modeling through reasoning trajectory generation, offering significant improvements in accuracy and interpretability compared to traditional methods
Downloads 29
Release Time : 5/6/2025

Model Overview

This model achieves interpretable reward scoring through two-stage training (reasoning trajectory distillation and reinforcement learning), suitable for RLHF/RLAIF and automated evaluation scenarios

Model Features

Interpretable Scoring
Provides fully transparent evaluation by generating scoring criteria or reasoning trajectories before expressing preferences
Two-Stage Training Framework
First distills 8.7K high-quality reasoning trajectories, then processes 64K preference pairs via RLVR
Performance Breakthrough
Achieves +13.8% absolute accuracy improvement on public benchmarks
Multi-Size Options
Offers 7B/14B/32B parameter versions and DeepSeek distilled checkpoints

Model Capabilities

Generating scoring criteria
Preference judgment
Reasoning trajectory generation
Open-domain QA evaluation
Dialogue quality scoring

Use Cases

Reinforcement Learning
RLHF/RLAIF
Serves as a plug-and-play reward function for policy optimization
Automated Evaluation
LLM Judge
Performs automatic scoring for open-domain QA, chat, and reasoning tasks
Research Tool
Process Supervision Research
Used for studying chain-of-thought verification or scoring criteria generation mechanisms
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase