PURE PRM 7B
这是一个基于Qwen2.5-Math-7B训练的过程奖励模型,用于提升数学推理能力
下载量 18
发布时间 : 2/9/2025
模型简介
该模型通过对PRM800K数据集微调Qwen2.5-Math-7B获得,主要用于评估数学推理过程和中间步骤的质量
模型特点
过程评估能力
专注于评估推理过程和中间步骤的质量,而非最终结果
数学推理优化
专门针对数学推理任务进行优化,提升推理步骤的准确性
步骤分隔评估
支持通过双换行符分隔解决方案步骤,对每个步骤进行独立评估
模型能力
数学推理评估
过程奖励计算
步骤质量分析
使用案例
数学教育
数学解题步骤评估
评估学生解题过程中每个步骤的正确性
提供每个步骤的奖励分数,帮助识别错误步骤
AI训练
强化学习奖励模型
作为强化学习中的奖励模型,指导AI改进数学推理能力
提升AI模型的数学推理准确性
🚀 基于Qwen2.5-Math-7B的PURE的过程奖励模型(PRM)
我们的过程奖励模型(PRM)用于微调大语言模型(LLM),以增强其数学推理能力。更多细节请查看我们的 PURE GitHub仓库。该模型是通过在开源数据集 PRM800K 的训练集上对 Qwen2.5-Math-7B 进行微调得到的。我们选择Qwen2.5-Math-7B而非Qwen2.5-Math-7B-Instruct,是为了使基础模型与我们的基线保持一致。我们将PRM800K中的原始标签1和0视为正标签,-1视为负标签。为避免测试数据污染,我们还移除了PRM800K训练样本中与MATH测试集数学问题相同的样本。
⚠️ 重要提示
本仓库与 Qwen的PRM 不同。我们基于 Qwen2.5-Math-7B 训练PRM,而Qwen的PRM基于 Qwen2.5-Math-7B-Instruct。
属性 | 详情 |
---|---|
基础模型 | Qwen/Qwen2.5-Math-7B |
任务类型 | 标记分类 |
训练数据 | HuggingFaceH4/prm800k-trl-dedup |
许可证 | Apache-2.0 |
🚀 快速开始
⚠️ 重要提示
PURE的PRM 是一种过程奖励模型,通常用于对推理和中间步骤的质量提供反馈,而非用于生成任务。
前提条件
- 步骤分隔:建议使用双换行符("\n\n")分隔解决方案中的各个步骤。
- 奖励计算:在每个步骤后插入一个标记 "
\n
"。为计算奖励,我们提取该标记的概率得分,并从正概率中减去负概率,得到一个介于 -1 和 1 之间的奖励值。我们将奖励 > 0 的步骤视为正确步骤,否则视为错误步骤。
🤗 Hugging Face Transformers
以下是使用 transformers
库调用我们的PRM的代码示例:
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
def make_step_rewards(logits, token_masks):
all_scores_res = []
for sample, token_mask in zip(logits, token_masks):
# sample: (seq_len, num_labels)
probs = sample[token_mask].softmax(dim=-1) # (num_steps, 2)
process_reward = probs[:, 1] - probs[:, 0] # (num_steps,)
# weighted sum to approx. min, highly recommend when BoN eval and Fine-tuning LLM
# weight = torch.softmax(
# -process_reward / 0.1,
# dim=-1,
# )
# process_reward = weight * process_reward
all_scores_res.append(process_reward.cpu().tolist())
return all_scores_res
model_name = "jinachris/PURE-PRM-7B"
device = "auto"
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True,
)
model = AutoModelForTokenClassification.from_pretrained(
model_name,
device_map=device,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).eval()
question = "Sue lives in a fun neighborhood. One weekend, the neighbors decided to play a prank on Sue. On Friday morning, the neighbors placed 18 pink plastic flamingos out on Sue's front yard. On Saturday morning, the neighbors took back one third of the flamingos, painted them white, and put these newly painted white flamingos back out on Sue's front yard. Then, on Sunday morning, they added another 18 pink plastic flamingos to the collection. At noon on Sunday, how many more pink plastic flamingos were out than white plastic flamingos?"
steps = [
"To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.",
"On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, (1/3 \\times 18 = 6) flamingos are taken back. So, they have (18 - 6 = 12) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has (12 + 6 = 18) pink flamingos and 6 white flamingos.",
"On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has (18 + 18 = 36) pink flamingos and still 6 white flamingos.",
"To find the difference, subtract the number of white flamingos from the number of pink flamingos: (36 - 6 = 30). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is (\\boxed{30})."
]
step_separator = "\n"
step_separator_token = tokenizer(
step_separator,
add_special_tokens=False,
return_tensors='pt',
)['input_ids']
input_ids = tokenizer(
question,
add_special_tokens=False,
return_tensors='pt',
)['input_ids']
score_ids = []
for step in steps:
step_ids = tokenizer(
step,
add_special_tokens=False,
return_tensors='pt',
)['input_ids']
input_ids = torch.cat(
[input_ids, step_ids, step_separator_token],
dim=-1,
)
score_ids.append(input_ids.size(-1) - 1)
input_ids = input_ids.to(model.device)
token_masks = torch.zeros_like(input_ids, dtype=torch.bool)
token_masks[0, score_ids] = True
assert torch.all(input_ids[token_masks].to("cpu") == step_separator_token)
logits = model(input_ids).logits
step_reward = make_step_rewards(logits, token_masks)
print(step_reward) # [[0.796875, 0.185546875, -0.0625, 0.078125]]
# For BoN eval,
# uncomment the weighted sum part in `make_step_rewards` func,
# then sum the rewards to get the final score (outcome reward):
# torch.tensor(step_reward).sum(dim=-1)
此外,我们还提供了在RLHFlow数据上进行BoN评估的代码:
import numpy as np
import torch
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoModelForTokenClassification, AutoTokenizer
ds_names = ["GSM8K", "MATH500"]
ds = [
load_dataset(
f"RLHFlow/Deepseek-{ds_name}-Test"
)['test'] for ds_name in ds_names
]
def make_step_rewards(logits, token_masks):
all_scores_res = []
for sample, token_mask in zip(logits, token_masks):
# sample: (seq_len, num_labels)
probs = sample[token_mask].softmax(dim=-1) # (num_steps, 2)
process_reward = probs[:, 1] - probs[:, 0] # (num_steps,)
# weighted sum to approx. min, highly recommend when BoN eval and Fine-tuning LLM
weight = torch.softmax(
-process_reward / 0.1,
dim=-1,
)
process_reward = weight * process_reward
all_scores_res.append(process_reward.cpu().tolist())
return all_scores_res
model_name = "jinachris/PURE-PRM-7B"
device = "auto"
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True,
)
model = AutoModelForTokenClassification.from_pretrained(
model_name,
device_map=device,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).eval()
step_separator = "\n"
step_separator_token = tokenizer(
step_separator,
add_special_tokens=False,
return_tensors='pt',
)['input_ids']
for ds_item, ds_name in zip(ds, ds_names):
# sampled_ids = np.random.choice(range(len(ds_item)), size=100, replace=False)
correct = 0
total = 0
for idx in tqdm(range(len(ds_item)), desc=f"Processing questions in {ds_name}"):
question = ds_item['prompt'][idx]
answers = ds_item['answers'][idx]
labels = ds_item['label'][idx]
outcome_scores = []
question_ids = tokenizer(
question,
add_special_tokens=False,
return_tensors='pt',
)['input_ids']
for answer in tqdm(answers, desc="Processing answers"):
steps = [i.rstrip() for i in answer.split("\n\n")]
input_ids = question_ids.clone()
score_ids = []
for step in steps:
step_ids = tokenizer(
step,
add_special_tokens=False,
return_tensors='pt',
)['input_ids']
input_ids = torch.cat(
[input_ids, step_ids, step_separator_token],
dim=-1,
)
score_ids.append(input_ids.size(-1) - 1)
input_ids = input_ids.to(model.device, dtype=torch.long)
token_masks = torch.zeros_like(input_ids, dtype=torch.bool)
token_masks[0, score_ids] = True
assert torch.all(input_ids[token_masks].to("cpu") == step_separator_token)
with torch.no_grad():
logits = model(input_ids).logits
step_reward = make_step_rewards(logits, token_masks)
outcome_reward = torch.tensor(step_reward).sum(dim=-1)
# TODO: batch input & output
outcome_scores.append(outcome_reward.item())
best_idx = np.argmax(outcome_scores)
if labels[best_idx] == 1:
correct += 1
total += 1
print(f"Accuracy on {ds_name}: {correct / total}")
📦 安装指南
- 对于Qwen2.5-Math模型,需要
transformers>=4.40.0
,建议使用最新版本。
📄 许可证
本项目采用Apache-2.0许可证。
📚 引用
如果您觉得我们的工作有帮助,请引用以下文献:
@article{cheng2025stop,
title={Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning},
author={Cheng, Jie and Qiao, Ruixi and Li, Lijun and Guo, Chao and Wang, Junle and Xiong, Gang and Lv, Yisheng and Wang, Fei-Yue},
journal={arXiv preprint arXiv:2504.15275},
year={2025}
}
Phi 2 GGUF
其他
Phi-2是微软开发的一个小型但强大的语言模型,具有27亿参数,专注于高效推理和高质量文本生成。
大型语言模型 支持多种语言
P
TheBloke
41.5M
205
Roberta Large
MIT
基于掩码语言建模目标预训练的大型英语语言模型,采用改进的BERT训练方法
大型语言模型 英语
R
FacebookAI
19.4M
212
Distilbert Base Uncased
Apache-2.0
DistilBERT是BERT基础模型的蒸馏版本,在保持相近性能的同时更轻量高效,适用于序列分类、标记分类等自然语言处理任务。
大型语言模型 英语
D
distilbert
11.1M
669
Llama 3.1 8B Instruct GGUF
Meta Llama 3.1 8B Instruct 是一个多语言大语言模型,针对多语言对话用例进行了优化,在常见的行业基准测试中表现优异。
大型语言模型 英语
L
modularai
9.7M
4
Xlm Roberta Base
MIT
XLM-RoBERTa是基于100种语言的2.5TB过滤CommonCrawl数据预训练的多语言模型,采用掩码语言建模目标进行训练。
大型语言模型 支持多种语言
X
FacebookAI
9.6M
664
Roberta Base
MIT
基于Transformer架构的英语预训练模型,通过掩码语言建模目标在海量文本上训练,支持文本特征提取和下游任务微调
大型语言模型 英语
R
FacebookAI
9.3M
488
Opt 125m
其他
OPT是由Meta AI发布的开放预训练Transformer语言模型套件,参数量从1.25亿到1750亿,旨在对标GPT-3系列性能,同时促进大规模语言模型的开放研究。
大型语言模型 英语
O
facebook
6.3M
198
1
基于transformers库的预训练模型,适用于多种NLP任务
大型语言模型
Transformers

1
unslothai
6.2M
1
Llama 3.1 8B Instruct
Llama 3.1是Meta推出的多语言大语言模型系列,包含8B、70B和405B参数规模,支持8种语言和代码生成,优化了多语言对话场景。
大型语言模型
Transformers 支持多种语言

L
meta-llama
5.7M
3,898
T5 Base
Apache-2.0
T5基础版是由Google开发的文本到文本转换Transformer模型,参数规模2.2亿,支持多语言NLP任务。
大型语言模型 支持多种语言
T
google-t5
5.4M
702
精选推荐AI模型
Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型,性能媲美GPT-3.5-turbo,优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型
Transformers 支持多种语言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型,专为边缘设备推理设计,体积仅为Cosmo-3B模型的2%左右。
对话系统
Transformers 英语

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型,适用于从给定文本中提取答案的任务。
问答系统 中文
R
uer
2,694
98