POLAR-7Bオープンソースモデル - 効果的な区別戦略を用いて、人間の嗜好にアライメントした報酬評価を実現

ホーム

POLAR 7B

internlmによって開発

POLAR-7Bは大規模事前学習に基づくスカラー報酬モデルで、革新的な戦略判別式学習パラダイムを採用し、戦略を効果的に区別し、人間の嗜好と一致させることができます。

大規模言語モデル

Transformers

複数言語対応オープンソースライセンス:Apache-2.0 #戦略判別式学習 #強化学習報酬モデル #多言語報酬評価

ダウンロード数 316

リリース時間 : 7/4/2025

モデル概要

POLAR-7Bはスカラーベースの報酬モデルで、強化学習用に設計されています。大規模事前学習と少量の嗜好データの微調整により、迅速に人間の嗜好と一致させることができ、テキストソートタスクに適しています。

モデル特徴

革新的な事前学習パラダイム

POLARは報酬モデルを訓練して、同じ戦略を識別し、異なる戦略を区別し、戦略間の相対的な差異を捉えます。

強化微調整用に設計

POLARは与えられた参照に基づいて大規模言語モデルの軌跡に報酬を割り当て、強化微調整（RFT）フレームワークと完璧に適合します。

卓越した性能と汎化能力

POLARは下流の強化学習タスクで最先端の成果を達成し、未見のシナリオに効果的に汎化でき、報酬破解問題を大幅に減らすことができます。

カスタマイズが容易

事前学習チェックポイントを提供し、研究者が様々なカスタムシナリオに対して報酬モデルを簡単に微調整できるようにします。

モデル能力

戦略判別

テキストソート

報酬信号生成

強化学習サポート

使用事例

閉鎖的質問回答

カウント問題

カウント問題の回答の正確性を評価します。

正しいカウント回答と誤ったカウント回答を正確に区別できます。

開放的質問回答

書籍要約

書籍内容の要約品質を評価します。

高品質で簡潔で要件を満たす要約を識別できます。

🚀 POLAR-7B：大規模事前学習に基づくスカラー報酬モデル

POLARは、大規模事前学習によって実現されたスカラーベースの報酬モデルの画期的な成果です。革新的な**ポリシー判別型学習（POLAR）**パラダイム —— 拡張可能な高度な最適化目標 —— を利用し、大規模な合成コーパスを通じてポリシーを効果的に区別します。事前学習の後、POLAR報酬モデルは少量の嗜好データを用いて微調整され、迅速に人間の嗜好に合わせることができます。

属性	詳細
ベースモデル	internlm/internlm2_5-7b
サポート言語	英語、中国語
ライセンス	Apache-2.0
タグ	報酬、強化学習、RFT、報酬モデル
タスクタイプ	テキストソート
依存ライブラリ	transformers

💥 Github | 📜 Paper

English | 简体中文

🚀 クイックスタート

✨ 主な機能

革新的な事前学習パラダイム：POLARは報酬モデルを訓練して、同じポリシーを識別し、異なるポリシーを区別します。従来の絶対的な嗜好に依存する報酬モデリング方法とは異なり、POLARは2つのポリシー間の相対的な差異を捉えます。これは、一般的なランキング関係をモデリングするための拡張可能な高度な最適化目標です。
強化微調整に特化：POLARは与えられた参照に基づいて大規模言語モデルの軌跡に報酬を割り当て、強化微調整（RFT）フレームワークと完璧に適合し、一般的なシナリオでのRFTの適用に有望な解決策を提供します。
卓越した性能と汎化能力：POLARは下流の強化学習タスクで最先端の成果を達成し、常に正確で信頼性の高い報酬信号を提供し、未見のシナリオに効果的に汎化し、報酬の不正利用問題を大幅に減らします。
カスタマイズが容易：POLARの事前学習チェックポイントを提供し、研究者が様々なカスタムシナリオに対して報酬モデルを簡単に微調整できるようにし、特定のアプリケーションや実験の要件に応じて直接調整および拡張できます。

📦 インストール

最新の xtuner を使用して、POLARを微調整し、使用することができます。Xtunerは、効率的で柔軟性があり、機能が充実した大規模言語モデルの微調整ツールキットです。

condaを使用してPython 3.10の仮想環境を作成することをお勧めします。

conda create --name xtuner-env python=3.10 -y
conda activate xtuner-env

pipを通じてxtunerをインストールします。

pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'

💻 使用例

推論

lmdeploy、sglang、vllm を通じた報酬推論をサポートしています。これらの推論エンジンを使用する際には、潜在的な依存関係の衝突を防ぐために、condaを使用して仮想環境を設定することをお勧めします。

データ形式

従来の報酬モデルとは異なり、POLARは推論時に追加の参照軌跡をサンプルとして必要とし、候補軌跡と提供された参照との一致性を測定することで、それらを評価します。

data = [
    {
        "prompt": [{"role": "user", "content": "What is the capital of China?"}],
        "reference": [{"role": "assistant", "content": "Beijing."}],
        "output": [{"role": "assistant", "content": "Beijing."}]
    },
    {
        "prompt": [{"role": "user", "content": "What is the capital of China?"}],
        "reference": [{"role": "assistant", "content": "Beijing."}],
        "output": [{"role": "assistant", "content": "Shanghai."}]
    }
]

transformersを使用した推論

from transformers import AutoModel, AutoTokenizer
from xtuner.utils import RewardModelClient

model_name = 'internlm/POLAR-7B'

model = AutoModel.from_pretrained(
    model_name,
    device_map="cuda", 
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

client = RewardModelClient(model_name)
encoded_data = client.encode(data)
batch = tokenizer(encoded_data, return_tensors='pt', padding=True).to('cuda')
outputs = model(**batch)
rewards = outputs[0].squeeze(-1).cpu().tolist()
print(rewards)
# [-0.5702977776527405, -11.030370712280273] for previous example data

lmdeployを使用した推論

LMDeploy は、大規模言語モデルの圧縮、デプロイ、サービスを行うためのツールキットです。

lmdeploy serve api_server internlm/POLAR-7B --backend pytorch --server-port 30000

from xtuner.utils import RewardModelClient

client = RewardModelClient("internlm/POLAR-7B",
                           server_type="lmdeploy",
                           server_address="127.0.0.1:30000")

# 直接報酬をリクエスト
rewards = client(data)
print(rewards)

# まずデータをエンコードし、次にリクエスト関数を介して報酬を取得
encoded_data = client.encode(data)
rewards = client.lmdeploy_request_reward(encoded_data)
print(rewards)

sglangを使用した推論

python3 -m sglang.launch_server --model internlm/POLAR-7B --trust-remote-code --is-embedding --dp 4 --tp 2 --mem-fraction-static 0.9 --port 30000

from xtuner.utils import RewardModelClient

client = RewardModelClient("internlm/POLAR-7B",
                           server_type="sglang",
                           server_address="127.0.0.1:30000")

# 直接報酬をリクエスト
rewards = client(data)
print(rewards)

# まずデータをエンコードし、次にリクエスト関数を介して報酬を取得
encoded_data = client.encode(data)
rewards = client.sglang_request_reward(encoded_data)
print(rewards)

vllmを使用した推論

vllm serve internlm/POLAR-7B --task=reward --trust-remote-code --tensor-parallel-size=2 --port 30000

from xtuner.utils import RewardModelClient

client = RewardModelClient("internlm/POLAR-7B",
                           server_type="vllm",
                           server_address="127.0.0.1:30000")

# 直接報酬をリクエスト
rewards = client(data)
print(rewards)

# まずデータをエンコードし、次にリクエスト関数を介して報酬を取得
encoded_data = client.encode(data)
rewards = client.vllm_request_reward(encoded_data)
print(rewards)

微調整

依存関係

flash_attn
tensorboard

データ形式

従来の報酬モデルとは異なり、POLARは微調整中に追加の参照軌跡と、選択された軌跡と拒否された軌跡をサンプルとして必要とします。train.jsonl ファイルに以下の形式で微調整データを構築することができます。

{
    "prompt": [{"role": "user", "content": "What is the capital of China?"}],
    "reference": [{"role": "assistant", "content": "Beijing."}],
    "chosen": [{"role": "assistant", "content": "Beijing."}],
    "rejected": [{"role": "assistant", "content": "Shanghai."}]
}

トレーニング手順

手順0：設定ファイルを準備します。ここにサンプル設定ファイルを提供しています。提供された設定ファイルが要件を満たさない場合は、提供された設定ファイルをコピーし、xtunerガイドに従って変更してください。報酬モデルのトレーニング設定の詳細については、xtunerの報酬モデルガイドを参照してください。
手順1：微調整を開始します。

xtuner train ${CONFIG_FILE_PATH}

例えば、以下のコマンドでPOLAR-7B-Baseの微調整を開始できます。

# 単一GPUで
xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2

# 複数GPUで
NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2

ここで、--deepspeed は DeepSpeed を使用してトレーニングを最適化することを意味します。XtunerはZeRO-1、ZeRO-2、ZeRO-3などのさまざまな戦略を統合しています。この機能を無効にする場合は、このパラメータを削除するだけです。

手順2：保存されたPTHモデル（DeepSpeedを使用する場合はディレクトリになります）をHugging Faceモデルに変換します。

xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}

例

閉鎖的な質問

from xtuner.utils import RewardModelClient

prompt = "How many 'r's are there in the word 'strawberry'?"
reference = "There are 3 'r's in the word 'strawberry'. Here's how we can count them: 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. So, the answer is 3."
outputs = [
    # 参照応答と同じ
    "There are 3 'r's in the word 'strawberry'. Here's how we can count them: 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. So, the answer is 3.", 
    # 正しい回答と正しい思考過程
    "Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are three 'r's, so the answer is three.",  
    # 誤った回答と誤った思考過程
    "Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.",
    # 正しい思考過程で誤った回答
    "Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are three 'r's, so the answer is two.", 
    # 誤った思考過程で正しい回答
    "Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is three.", 
    # 思考過程なしで正しい回答
    "There are 3 'r's in the word 'strawberry'.",
    # 思考過程なしで誤った回答
    "There are 2 'r's in the word 'strawberry'.",
]
data = [{"prompt": prompt, "reference": reference, "output": output} for output in outputs]

client = RewardModelClient("internlm/POLAR-7B", server_type="sglang", server_address="127.0.0.1:30000")
rewards = client(data)

sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)

for output, reward in sorted_res:
    print(f"Output: {output}
Reward: {reward}
")

Output: There are 3 'r's in the word 'strawberry'. Here's how we can count them: 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. So, the answer is 3.
Reward: 0.054595947265625

Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are three 'r's, so the answer is three.
Reward: -2.005859375

Output: There are 3 'r's in the word 'strawberry'.
Reward: -6.70703125

Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is three.
Reward: -7.10546875

Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
Reward: -7.1328125

Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
Reward: -8.46875

Output: There are 2 'r's in the word 'strawberry'.
Reward: -10.8203125

オープンな質問

from xtuner.utils import RewardModelClient

prompt = "Summarize the first book of Frank Herbert’s Dune in one witty short sentence."
reference = "Royal teen discovers that life’s a beach—minus the ocean, plus spice, giant sandworms and deadly politics."
outputs = [
    # 参照応答と同じ
    "Royal teen discovers that life’s a beach—minus the ocean, plus spice, giant sandworms and deadly politics.",
    # 参照応答に似ているが事実誤りがある
    "Royal teen discovers that life’s a beach—minus the ocean, plus magic, dark wizards and deadly politics.",
    # 参照応答とは大きく異なり、他のドラマから類推した簡潔で面白い要約
    "Young noble’s move to desert planet turns into galactic Game of Thrones with fewer dragons, more worms.",
    # 簡潔な要約だが、面白さに欠ける
    "A noble family’s fall sparks a young heir’s rise as a leader on a harsh desert planet governed by prophecy and survival.",
    # 面白い要約だが、長すぎる
    "Paul Atreides loses his father, gains prophetic powers, learns to ride a sandworm, leads a holy war, and discovers that being the chosen one comes with a lot of blood, sand, and questionable decisions.",
    # 簡潔で面白い要約だが、最初の本だけでなく複数のDuneの本から要約している
    "Boy gets planet, becomes god, loses soul — family drama ensues across galaxies."
]
data = [{"prompt": prompt, "reference": reference, "output": output} for output in outputs]

client = RewardModelClient("internlm/POLAR-7B", server_type="sglang", server_address="127.0.0.1:30000")
rewards = client(data)

sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)

for output, reward in sorted_res:
    print(f"Output: {output}
Reward: {reward}
")

Output: Royal teen discovers that life’s a beach—minus the ocean, plus spice, giant sandworms and deadly politics.
Reward: 0.466552734375

Output: Young noble’s move to desert planet turns into galactic Game of Thrones with fewer dragons, more worms.
Reward: -6.91796875

Output: Royal teen discovers that life’s a beach—minus the ocean, plus magic, dark wizards and deadly politics.
Reward: -7.70703125

Output: Paul Atreides loses his father, gains prophetic powers, learns to ride a sandworm, leads a holy war, and discovers that being the chosen one comes with a lot of blood, sand, and questionable decisions.
Reward: -8.4296875

Output: A noble family’s fall sparks a young heir’s rise as a leader on a harsh desert planet governed by prophecy and survival.
Reward: -8.6484375

Output: Boy gets planet, becomes god, loses soul — family drama ensues across galaxies.
Reward: -10.359375

📄 ライセンス

コードとモデルの重みはApache 2.0ライセンスに従います。

引用

@article{dou2025pretrained,
  title={Pre-Trained Policy Discriminators are General Reward Models},
  author={Dou, Shihan and Liu, Shichun and Yang, Yuming and Zou, Yicheng and Zhou, Yunhua and Xing, Shuhao and Huang, Chenhao and Ge, Qiming and Song, Demin and Lv, Haijun and others},
  journal={arXiv preprint arXiv:2507.05197},
  year={2025}
}