R

RM R1 DeepSeek Distilled Qwen 14B

Developed by gaotang
RM-R1 is a training framework for reasoning reward models (ReasRM), which evaluates candidate answers by generating scoring criteria or reasoning traces, providing explainable judgments.
Downloads 95
Release Time : 5/6/2025

Model Overview

This model adopts a two-stage training approach, first distilling high-quality reasoning traces, then optimizing with verifiable reward reinforcement learning, suitable for RLHF/RLAIF, automated evaluation, and research purposes.

Model Features

Reasoning Reward Modeling
Evaluates answers by generating scoring criteria or reasoning traces, providing a fully explainable judgment process.
Two-Stage Training
First distills high-quality reasoning traces, then optimizes with verifiable reward reinforcement learning.
High Performance
Achieves an absolute accuracy improvement of up to +13.8% on public reward model benchmarks.

Model Capabilities

Text Ranking
Generating Scoring Criteria
Generating Reasoning Traces
Preference Expression

Use Cases

Reinforcement Learning
RLHF/RLAIF
Used as a plug-and-play reward function for policy optimization.
Automated Evaluation
LLM Judge
Used for automated evaluation of open-domain QA, chat, and reasoning.
Research
Process Supervision Research
Used for studying chain-of-thought verification or scoring criteria generation.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase