PURE - PRM - 7B开源模型 - 免费部署助力提升数学推理能力

首页

PURE PRM 7B

由 jinachris 开发

这是一个基于Qwen2.5-Math-7B训练的过程奖励模型，用于提升数学推理能力

大型语言模型

Safetensors

开源协议:Apache-2.0 #数学推理评估 #过程奖励模型 #步骤级评分

下载量 18

发布时间 : 2/9/2025

模型简介

该模型通过对PRM800K数据集微调Qwen2.5-Math-7B获得，主要用于评估数学推理过程和中间步骤的质量

模型特点

过程评估能力

专注于评估推理过程和中间步骤的质量，而非最终结果

数学推理优化

专门针对数学推理任务进行优化，提升推理步骤的准确性

步骤分隔评估

支持通过双换行符分隔解决方案步骤，对每个步骤进行独立评估

模型能力

数学推理评估

过程奖励计算

步骤质量分析

使用案例

数学教育

数学解题步骤评估

评估学生解题过程中每个步骤的正确性

提供每个步骤的奖励分数，帮助识别错误步骤

AI训练

强化学习奖励模型

作为强化学习中的奖励模型，指导AI改进数学推理能力

提升AI模型的数学推理准确性

🚀 基于Qwen2.5-Math-7B的PURE的过程奖励模型（PRM）

我们的过程奖励模型（PRM）用于微调大语言模型（LLM），以增强其数学推理能力。更多细节请查看我们的 PURE GitHub仓库。该模型是通过在开源数据集 PRM800K 的训练集上对 Qwen2.5-Math-7B 进行微调得到的。我们选择Qwen2.5-Math-7B而非Qwen2.5-Math-7B-Instruct，是为了使基础模型与我们的基线保持一致。我们将PRM800K中的原始标签1和0视为正标签，-1视为负标签。为避免测试数据污染，我们还移除了PRM800K训练样本中与MATH测试集数学问题相同的样本。

⚠️ 重要提示

本仓库与 Qwen的PRM 不同。我们基于 Qwen2.5-Math-7B 训练PRM，而Qwen的PRM基于 Qwen2.5-Math-7B-Instruct。

属性	详情
基础模型	Qwen/Qwen2.5-Math-7B
任务类型	标记分类
训练数据	HuggingFaceH4/prm800k-trl-dedup
许可证	Apache-2.0

🚀 快速开始

⚠️ 重要提示

PURE的PRM 是一种过程奖励模型，通常用于对推理和中间步骤的质量提供反馈，而非用于生成任务。

前提条件

步骤分隔：建议使用双换行符（"\n\n"）分隔解决方案中的各个步骤。
奖励计算：在每个步骤后插入一个标记 "\n"。为计算奖励，我们提取该标记的概率得分，并从正概率中减去负概率，得到一个介于 -1 和 1 之间的奖励值。我们将奖励 > 0 的步骤视为正确步骤，否则视为错误步骤。

🤗 Hugging Face Transformers

以下是使用 transformers 库调用我们的PRM的代码示例：

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer


def make_step_rewards(logits, token_masks):
    all_scores_res = []
    for sample, token_mask in zip(logits, token_masks):
        # sample: (seq_len, num_labels)
        probs = sample[token_mask].softmax(dim=-1)  # (num_steps, 2)
        process_reward = probs[:, 1] - probs[:, 0]  # (num_steps,)
        # weighted sum to approx. min, highly recommend when BoN eval and Fine-tuning LLM
        # weight = torch.softmax(
        #     -process_reward / 0.1, 
        #     dim=-1,
        # )
        # process_reward = weight * process_reward
        all_scores_res.append(process_reward.cpu().tolist())
    return all_scores_res

model_name = "jinachris/PURE-PRM-7B"
device = "auto"

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True,
)
model = AutoModelForTokenClassification.from_pretrained(
    model_name, 
    device_map=device, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).eval()

question = "Sue lives in a fun neighborhood.  One weekend, the neighbors decided to play a prank on Sue.  On Friday morning, the neighbors placed 18 pink plastic flamingos out on Sue's front yard.  On Saturday morning, the neighbors took back one third of the flamingos, painted them white, and put these newly painted white flamingos back out on Sue's front yard.  Then, on Sunday morning, they added another 18 pink plastic flamingos to the collection. At noon on Sunday, how many more pink plastic flamingos were out than white plastic flamingos?"
steps = [
    "To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.",
    "On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, (1/3 \\times 18 = 6) flamingos are taken back. So, they have (18 - 6 = 12) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has (12 + 6 = 18) pink flamingos and 6 white flamingos.",
    "On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has (18 + 18 = 36) pink flamingos and still 6 white flamingos.",
    "To find the difference, subtract the number of white flamingos from the number of pink flamingos: (36 - 6 = 30). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is (\\boxed{30})."
]

step_separator = "\n"
step_separator_token = tokenizer(
    step_separator, 
    add_special_tokens=False, 
    return_tensors='pt',
)['input_ids']
input_ids = tokenizer(
    question, 
    add_special_tokens=False, 
    return_tensors='pt',
)['input_ids']

score_ids = []
for step in steps:
    step_ids = tokenizer(
        step, 
        add_special_tokens=False, 
        return_tensors='pt',
    )['input_ids']
    input_ids = torch.cat(
        [input_ids, step_ids, step_separator_token], 
        dim=-1,
    )
    score_ids.append(input_ids.size(-1) - 1)

input_ids = input_ids.to(model.device)
token_masks = torch.zeros_like(input_ids, dtype=torch.bool)
token_masks[0, score_ids] = True
assert torch.all(input_ids[token_masks].to("cpu") == step_separator_token)

logits = model(input_ids).logits
step_reward = make_step_rewards(logits, token_masks)
print(step_reward)  # [[0.796875, 0.185546875, -0.0625, 0.078125]]

# For BoN eval, 
# uncomment the weighted sum part in `make_step_rewards` func, 
# then sum the rewards to get the final score (outcome reward): 
# torch.tensor(step_reward).sum(dim=-1)

此外，我们还提供了在RLHFlow数据上进行BoN评估的代码：

import numpy as np
import torch
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoModelForTokenClassification, AutoTokenizer

ds_names = ["GSM8K", "MATH500"]
ds = [
    load_dataset(
        f"RLHFlow/Deepseek-{ds_name}-Test"
    )['test'] for ds_name in ds_names
]

def make_step_rewards(logits, token_masks):
    all_scores_res = []
    for sample, token_mask in zip(logits, token_masks):
        # sample: (seq_len, num_labels)
        probs = sample[token_mask].softmax(dim=-1)  # (num_steps, 2)
        process_reward = probs[:, 1] - probs[:, 0]  # (num_steps,)
        # weighted sum to approx. min, highly recommend when BoN eval and Fine-tuning LLM
        weight = torch.softmax(
            -process_reward / 0.1, 
            dim=-1,
        )
        process_reward = weight * process_reward
        all_scores_res.append(process_reward.cpu().tolist())
    return all_scores_res


model_name = "jinachris/PURE-PRM-7B"
device = "auto"

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True,
)
model = AutoModelForTokenClassification.from_pretrained(
    model_name, 
    device_map=device, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).eval()

step_separator = "\n"
step_separator_token = tokenizer(
    step_separator, 
    add_special_tokens=False, 
    return_tensors='pt',
)['input_ids']


for ds_item, ds_name in zip(ds, ds_names):
    # sampled_ids = np.random.choice(range(len(ds_item)), size=100, replace=False)
    correct = 0
    total = 0
    for idx in tqdm(range(len(ds_item)), desc=f"Processing questions in {ds_name}"):
        question = ds_item['prompt'][idx]
        answers = ds_item['answers'][idx]
        labels = ds_item['label'][idx]
        outcome_scores = []

        question_ids = tokenizer(
            question, 
            add_special_tokens=False, 
            return_tensors='pt',
        )['input_ids']
        for answer in tqdm(answers, desc="Processing answers"):
            steps = [i.rstrip() for i in answer.split("\n\n")]
            input_ids = question_ids.clone()

            score_ids = []
            for step in steps:
                step_ids = tokenizer(
                    step, 
                    add_special_tokens=False, 
                    return_tensors='pt',
                )['input_ids']
                input_ids = torch.cat(
                    [input_ids, step_ids, step_separator_token], 
                    dim=-1,
                )
                score_ids.append(input_ids.size(-1) - 1)

            input_ids = input_ids.to(model.device, dtype=torch.long)
            token_masks = torch.zeros_like(input_ids, dtype=torch.bool)
            token_masks[0, score_ids] = True
            assert torch.all(input_ids[token_masks].to("cpu") == step_separator_token)
            
            with torch.no_grad():
                logits = model(input_ids).logits
                step_reward = make_step_rewards(logits, token_masks)
                outcome_reward = torch.tensor(step_reward).sum(dim=-1)

            # TODO: batch input & output
            outcome_scores.append(outcome_reward.item())
        
        best_idx = np.argmax(outcome_scores)
        if labels[best_idx] == 1:
            correct += 1
        total += 1
    print(f"Accuracy on {ds_name}: {correct / total}")

📦 安装指南

对于Qwen2.5-Math模型，需要 transformers>=4.40.0，建议使用最新版本。

📄 许可证

本项目采用Apache-2.0许可证。

📚 引用

如果您觉得我们的工作有帮助，请引用以下文献：

@article{cheng2025stop,
  title={Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning},
  author={Cheng, Jie and Qiao, Ruixi and Li, Lijun and Guo, Chao and Wang, Junle and Xiong, Gang and Lv, Yisheng and Wang, Fei-Yue},
  journal={arXiv preprint arXiv:2504.15275},
  year={2025}
}