PURE - PRM - 7Bオープンソースモデル - 無料でデプロイして数学的推論能力向上を支援

ホーム

PURE PRM 7B

jinachrisによって開発

これはQwen2.5-Math-7Bでトレーニングされたプロセス報酬モデルで、数学的推論能力を向上させるために使用されます

大規模言語モデル

Safetensors

オープンソースライセンス:Apache-2.0 #数学的推論評価 #プロセス報酬モデル #ステップレベル評価

ダウンロード数 18

リリース時間 : 2/9/2025

モデル概要

このモデルはPRM800KデータセットでQwen2.5-Math-7Bをファインチューニングして取得され、主に数学的推論プロセスと中間ステップの品質を評価するために使用されます

モデル特徴

プロセス評価能力

最終結果ではなく、推論プロセスと中間ステップの品質評価に焦点を当てています

数学的推論最適化

数学的推論タスクに特化して最適化され、推論ステップの正確性を向上させます

ステップ分離評価

ダブル改行で解決策のステップを分離し、各ステップを独立して評価することをサポートします

モデル能力

数学的推論評価

プロセス報酬計算

ステップ品質分析

使用事例

数学教育

数学問題解決ステップ評価

学生の解答プロセスにおける各ステップの正確性を評価します

各ステップの報酬スコアを提供し、誤ったステップを特定するのに役立ちます

AIトレーニング

強化学習報酬モデル

強化学習における報酬モデルとして、AIの数学的推論能力の改善を指導します

AIモデルの数学的推論の正確性を向上させます

🚀 Qwen2.5-Math-7BをベースとしたPUREのPRM

このPRMは、大規模言語モデル（LLM）を微調整し、数学的推論能力を向上させるために使用されます。 詳細については、PUREのGitHubリポジトリを参照してください。これは、オープンソースデータセットPRM800KのトレーニングセットでQwen2.5-Math-7Bを微調整することで得られます。ベースモデルをベースラインと一致させるため、Qwen2.5-Math-7B-InstructではなくQwen2.5-Math-7Bを選択しました。 PRM800Kの元の1と0のラベルを正のラベルとして扱い、-1を負のラベルとして扱います。テストデータの汚染を排除するため、MATHテストセットと同じ数学的クエリを持つPRM800Kのトレーニングサンプルも削除しています。

🚀 クイックスタート

⚠️ 重要な注意

PUREのPRMは、通常、生成ではなく推論と中間ステップの品質に関するフィードバックを提供するために使用されるプロセス報酬モデルです。

前提条件

ステップ分離: 解決策内の個々のステップを区切るために、二重改行（"\n\n"）の使用を推奨します。
報酬計算: 各ステップの後に、トークン "\n" を挿入します。報酬計算では、このトークンの確率スコアを抽出し、正の確率から負の確率を引き、-1から1の間の報酬値を得ます。報酬 > 0のステップを正しいと見なし、それ以外を誤りと見なします。

🤗 Hugging Face Transformers

以下に、transformers を使用してPRMを使う方法のコードスニペットを示します。

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer


def make_step_rewards(logits, token_masks):
    all_scores_res = []
    for sample, token_mask in zip(logits, token_masks):
        # sample: (seq_len, num_labels)
        probs = sample[token_mask].softmax(dim=-1)  # (num_steps, 2)
        process_reward = probs[:, 1] - probs[:, 0]  # (num_steps,)
        # weighted sum to approx. min, highly recommend when BoN eval and Fine-tuning LLM
        # weight = torch.softmax(
        #     -process_reward / 0.1, 
        #     dim=-1,
        # )
        # process_reward = weight * process_reward
        all_scores_res.append(process_reward.cpu().tolist())
    return all_scores_res

model_name = "jinachris/PURE-PRM-7B"
device = "auto"

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True,
)
model = AutoModelForTokenClassification.from_pretrained(
    model_name, 
    device_map=device, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).eval()

question = "Sue lives in a fun neighborhood.  One weekend, the neighbors decided to play a prank on Sue.  On Friday morning, the neighbors placed 18 pink plastic flamingos out on Sue's front yard.  On Saturday morning, the neighbors took back one third of the flamingos, painted them white, and put these newly painted white flamingos back out on Sue's front yard.  Then, on Sunday morning, they added another 18 pink plastic flamingos to the collection. At noon on Sunday, how many more pink plastic flamingos were out than white plastic flamingos?"
steps = [
    "To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.",
    "On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, (1/3 \\times 18 = 6) flamingos are taken back. So, they have (18 - 6 = 12) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has (12 + 6 = 18) pink flamingos and 6 white flamingos.",
    "On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has (18 + 18 = 36) pink flamingos and still 6 white flamingos.",
    "To find the difference, subtract the number of white flamingos from the number of pink flamingos: (36 - 6 = 30). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is (\\boxed{30})."
]

step_separator = "\n"
step_separator_token = tokenizer(
    step_separator, 
    add_special_tokens=False, 
    return_tensors='pt',
)['input_ids']
input_ids = tokenizer(
    question, 
    add_special_tokens=False, 
    return_tensors='pt',
)['input_ids']

score_ids = []
for step in steps:
    step_ids = tokenizer(
        step, 
        add_special_tokens=False, 
        return_tensors='pt',
    )['input_ids']
    input_ids = torch.cat(
        [input_ids, step_ids, step_separator_token], 
        dim=-1,
    )
    score_ids.append(input_ids.size(-1) - 1)

input_ids = input_ids.to(model.device)
token_masks = torch.zeros_like(input_ids, dtype=torch.bool)
token_masks[0, score_ids] = True
assert torch.all(input_ids[token_masks].to("cpu") == step_separator_token)

logits = model(input_ids).logits
step_reward = make_step_rewards(logits, token_masks)
print(step_reward)  # [[0.796875, 0.185546875, -0.0625, 0.078125]]

# For BoN eval, 
# uncomment the weighted sum part in `make_step_rewards` func, 
# then sum the rewards to get the final score (outcome reward): 
# torch.tensor(step_reward).sum(dim=-1)

さらに、RLHFlowのデータに対するBoN評価のコードを共有します。

import numpy as np
import torch
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoModelForTokenClassification, AutoTokenizer

ds_names = ["GSM8K", "MATH500"]
ds = [
    load_dataset(
        f"RLHFlow/Deepseek-{ds_name}-Test"
    )['test'] for ds_name in ds_names
]

def make_step_rewards(logits, token_masks):
    all_scores_res = []
    for sample, token_mask in zip(logits, token_masks):
        # sample: (seq_len, num_labels)
        probs = sample[token_mask].softmax(dim=-1)  # (num_steps, 2)
        process_reward = probs[:, 1] - probs[:, 0]  # (num_steps,)
        # weighted sum to approx. min, highly recommend when BoN eval and Fine-tuning LLM
        weight = torch.softmax(
            -process_reward / 0.1, 
            dim=-1,
        )
        process_reward = weight * process_reward
        all_scores_res.append(process_reward.cpu().tolist())
    return all_scores_res


model_name = "jinachris/PURE-PRM-7B"
device = "auto"

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True,
)
model = AutoModelForTokenClassification.from_pretrained(
    model_name, 
    device_map=device, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).eval()

step_separator = "\n"
step_separator_token = tokenizer(
    step_separator, 
    add_special_tokens=False, 
    return_tensors='pt',
)['input_ids']


for ds_item, ds_name in zip(ds, ds_names):
    # sampled_ids = np.random.choice(range(len(ds_item)), size=100, replace=False)
    correct = 0
    total = 0
    for idx in tqdm(range(len(ds_item)), desc=f"Processing questions in {ds_name}"):
        question = ds_item['prompt'][idx]
        answers = ds_item['answers'][idx]
        labels = ds_item['label'][idx]
        outcome_scores = []

        question_ids = tokenizer(
            question, 
            add_special_tokens=False, 
            return_tensors='pt',
        )['input_ids']
        for answer in tqdm(answers, desc="Processing answers"):
            steps = [i.rstrip() for i in answer.split("\n\n")]
            input_ids = question_ids.clone()

            score_ids = []
            for step in steps:
                step_ids = tokenizer(
                    step, 
                    add_special_tokens=False, 
                    return_tensors='pt',
                )['input_ids']
                input_ids = torch.cat(
                    [input_ids, step_ids, step_separator_token], 
                    dim=-1,
                )
                score_ids.append(input_ids.size(-1) - 1)

            input_ids = input_ids.to(model.device, dtype=torch.long)
            token_masks = torch.zeros_like(input_ids, dtype=torch.bool)
            token_masks[0, score_ids] = True
            assert torch.all(input_ids[token_masks].to("cpu") == step_separator_token)
            
            with torch.no_grad():
                logits = model(input_ids).logits
                step_reward = make_step_rewards(logits, token_masks)
                outcome_reward = torch.tensor(step_reward).sum(dim=-1)

            # TODO: batch input & output
            outcome_scores.append(outcome_reward.item())
        
        best_idx = np.argmax(outcome_scores)
        if labels[best_idx] == 1:
            correct += 1
        total += 1
    print(f"Accuracy on {ds_name}: {correct / total}")

📚 詳細情報

必要条件

Qwen2.5-Mathモデルには transformers>=4.40.0 が必要です。最新バージョンの使用を推奨します。

注意事項

⚠️ 重要提示

このリポジトリは、QwenのPRM とは異なります。私たちは Qwen2.5-Math-7B をベースにPRMをトレーニングしていますが、QwenのPRMは Qwen2.5-Math-7B-Instruct をベースにしています。

引用

もし私たちの研究が役に立った場合は、以下のように引用していただけると幸いです。

@article{cheng2025stop,
  title={Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning},
  author={Cheng, Jie and Qiao, Ruixi and Li, Lijun and Guo, Chao and Wang, Junle and Xiong, Gang and Lv, Yisheng and Wang, Fei-Yue},
  journal={arXiv preprint arXiv:2504.15275},
  year={2025}
}