RM - R1 - DeepSeek - Distilled - Qwen - 32B Open-source Reward Model - Interpretable Evaluation of Candidate Answers

RM R1 DeepSeek Distilled Qwen 32B

Developed by gaotang

RM-R1 is a training framework for reasoning reward models (ReasRM), which evaluates candidate answers by generating scoring criteria or reasoning trajectories, providing interpretable evaluations.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Reasoning Reward Modeling #Interpretability Scoring #RLHF Optimization

Downloads 506

Release Time : 5/6/2025

Model Overview

RM-R1 is a two-stage trained reasoning reward model that significantly improves the accuracy of preference judgments through distilling high-quality reasoning trajectories and reinforcement learning with verifiable rewards.

Model Features

Two-Stage Training

The first stage distills high-quality reasoning trajectories, and the second stage optimizes using reinforcement learning with verifiable rewards.

Interpretability

Provides fully interpretable evaluations by generating scoring criteria or reasoning trajectories.

High Performance

Achieves an absolute accuracy improvement of up to +13.8% on public reward model benchmarks.

Model Capabilities

Text Ranking

Generating Scoring Criteria

Reasoning Trajectory Generation

Preference Judgment

Use Cases

RLHF / RLAIF

Policy Optimization

Serves as a plug-and-play reward function for policy optimization.

Automatic Evaluation

LLM Judge

Used for automatic evaluation of open-domain QA, chat, and reasoning.

Research

Process Supervision

Research on process supervision, chain-of-thought verification, or scoring criteria generation.

Property	Details
Base Model	deepseek - ai/DeepSeek - R1 - Distill - Qwen - 32B
Language	en
License	mit
Pipeline Tag	text - ranking
Library Name	transformers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

RM R1 DeepSeek Distilled Qwen 32B

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 RM-R1: Reward Modeling as a Reasoning Task

🚀 Quick Start

✨ Features

Two - stage training

Backbones released

Intended uses

📚 Documentation

File information

📄 License