POLAR-7B Open-source Model - Effective Distinction Strategies for Reward Evaluation Aligned with Human Preferences

POLAR 7B

Developed by internlm

POLAR-7B is a scalar reward model based on large-scale pretraining. It adopts an innovative policy discriminative learning paradigm and can effectively distinguish policies and align with human preferences.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Policy discriminative learning #Reinforcement learning reward model #Multilingual reward evaluation

Downloads 316

Release Time : 7/4/2025

Model Overview

POLAR-7B is a scalar-based reward model designed specifically for reinforcement learning. Through large-scale pretraining and fine-tuning with a small amount of preference data, it can quickly align with human preferences and is suitable for text ranking tasks.

Model Features

Innovative pretraining paradigm

POLAR trains a reward model to identify the same policies and distinguish different policies, capturing the relative differences between policies.

Designed specifically for reinforcement fine-tuning

POLAR assigns rewards to large language model trajectories based on a given reference, which fits perfectly with the Reinforcement Fine-Tuning (RFT) framework.

Excellent performance and generalization ability

POLAR has achieved state-of-the-art results in downstream reinforcement learning tasks, can effectively generalize to unseen scenarios, and significantly reduces the reward hacking problem.

Easy to customize

Pretrained checkpoints are provided, enabling researchers to conveniently fine-tune the reward model for various customized scenarios.

Model Capabilities

Policy discrimination

Text ranking

Reward signal generation

Reinforcement learning support

Use Cases

Closed-ended question answering

Counting questions

Evaluate the accuracy of answers to counting questions

Can accurately distinguish correct and incorrect counting answers

Open-ended question answering

Book summary

Evaluate the quality of summaries of book content

Can identify high-quality, concise, and compliant summaries

🚀 POLAR - A Scalar-based Reward Model

POLAR is a scalar-based reward model that achieves significant breakthroughs through large-scale pre-training. It uses the innovative POLAR paradigm to effectively discriminate between policies with a large-scale synthetic corpora. After pre-training, it can be fine-tuned with minimal preference data to quickly align with human preferences.

💪 Github | 📜 Paper

English | 简体中文

🚀 Quick Start

📦 Installation

You could employ the latest xtuner to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.

It is recommended to build a Python-3.10 virtual environment using conda

conda create --name xtuner-env python=3.10 -y
conda activate xtuner-env

Install xtuner via pip

pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'

💻 Inference

We support reward inference through lmdeploy, sglang, and vllm. We recommend setting up a virtual environment with conda when using these inference engines to prevent potential dependency conflicts.

Data format

Unlike traditional reward models, POLAR requires an additional reference trajectory as a demonstration and evaluate candidate trajectories by measuring their consistency with the provided reference.

data = [
    {
        "prompt": [{"role": "user", "content": "What is the capital of China?"}],
        "reference": [{"role": "assistant", "content": "Beijing."}],
        "output": [{"role": "assistant", "content": "Beijing."}]
    },
    {
        "prompt": [{"role": "user", "content": "What is the capital of China?"}],
        "reference": [{"role": "assistant", "content": "Beijing."}],
        "output": [{"role": "assistant", "content": "Shanghai."}]
    }
]

Inference with transformers

Reward request

To load the POLAR model using transformers, use the following code to get rewards:

from transformers import AutoModel, AutoTokenizer
from xtuner.utils import RewardModelClient

model_name = 'internlm/POLAR-7B'

model = AutoModel.from_pretrained(
    model_name,
    device_map="cuda", 
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

client = RewardModelClient(model_name)
encoded_data = client.encode(data)
batch = tokenizer(encoded_data, return_tensors='pt', padding=True).to('cuda')
outputs = model(**batch)
rewards = outputs[0].squeeze(-1).cpu().tolist()
print(rewards)
# [-0.5702977776527405, -11.030370712280273] for previous example data

Inference with lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Requirements

lmdeploy >= 0.9.1

Server Launch

lmdeploy serve api_server internlm/POLAR-7B --backend pytorch --server-port 30000

Client Request

from xtuner.utils import RewardModelClient

client = RewardModelClient("internlm/POLAR-7B",
                           server_type="lmdeploy",
                           server_address="127.0.0.1:30000")

# Request rewards directly
rewards = client(data)
print(rewards)

# First encode data and then get rewards via the request function.
encoded_data = client.encode(data)
rewards = client.lmdeploy_request_reward(encoded_data)
print(rewards)

Inference with sglang

Requirements

0.4.3.post4 <= sglang <= 0.4.4.post1

Server Launch

python3 -m sglang.launch_server --model internlm/POLAR-7B --trust-remote-code --is-embedding --dp 4 --tp 2 --mem-fraction-static 0.9 --port 30000

Client Request

from xtuner.utils import RewardModelClient

client = RewardModelClient("internlm/POLAR-7B",
                           server_type="sglang",
                           server_address="127.0.0.1:30000")

# Request rewards directly
rewards = client(data)
print(rewards)

# First encode data and then get rewards via the request function.
encoded_data = client.encode(data)
rewards = client.sglang_request_reward(encoded_data)
print(rewards)

Inference with vllm

Requirements

vllm >= 0.8.0

Server Launch

vllm serve internlm/POLAR-7B --task=reward --trust-remote-code --tensor-parallel-size=2 --port 30000

Client Request

from xtuner.utils import RewardModelClient

client = RewardModelClient("internlm/POLAR-7B",
                           server_type="vllm",
                           server_address="127.0.0.1:30000")

# Request rewards directly
rewards = client(data)
print(rewards)

# First encode data and then get rewards via the request function.
encoded_data = client.encode(data)
rewards = client.vllm_request_reward(encoded_data)
print(rewards)

Fine-tune

Requirements

flash_attn
tensorboard

Data format

Unlike traditional reward models, POLAR requires an additional reference trajectory as a demonstration during fine-tuning, along with a chosen trajectory and a rejected trajectory. You can construct your fine-tuning data in a train.jsonl file, formatted as follows:

{
    "prompt": [{"role": "user", "content": "What is the capital of China?"}],
    "reference": [{"role": "assistant", "content": "Beijing."}],
    "chosen": [{"role": "assistant", "content": "Beijing."}],
    "rejected": [{"role": "assistant", "content": "Shanghai."}]
}

Training steps

Step 0: Prepare the config. We provide examplar ready-to-use configs here. If the provided configs cannot meet the requirements, please copy the provided config and do modification following the xtuner guideline. For more details of reward model training settings, please see the xtuner reward model guideline.
Step 1: Start fine-tuning.

xtuner train ${CONFIG_FILE_PATH}

For example, you can start the fine-tuning of POLAR-7B-Base by

# On a single GPU
xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2

# On multiple GPUs
NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2

Here, --deepspeed means using DeepSpeed to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.

Step 2: Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by

xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}

✨ Features

Innovative Pre-training Paradigm: POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
Tailored for Reinforcement Fine-tuning: POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
Superior Performance and Generalization: POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
Easy to Customize: Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.

💻 Usage Examples

Closed-ended questions

from xtuner.utils import RewardModelClient

prompt = "How many 'r's are there in the word 'strawberry'?"
reference = "There are 3 'r's in the word 'strawberry'. Here's how we can count them: 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. So, the answer is 3."
outputs = [
    # Same as the reference response.
    "There are 3 'r's in the word 'strawberry'. Here's how we can count them: 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. So, the answer is 3.", 
    # Correct answer with correct thoughts.
    "Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are three 'r's, so the answer is three.",  
    # Wrong answer with wrong thoughts.
    "Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.",
    # Wrong answer with correct thoughts.
    "Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are three 'r's, so the answer is two.", 
    # Correct answer with wrong thoughts.
    "Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is three.", 
    # Correct answer without thoughts.
    "There are 3 'r's in the word 'strawberry'.",
    # Wrong answer without thoughts.
    "There are 2 'r's in the word 'strawberry'."
]
data = [{"prompt": prompt, "reference": reference, "output": output} for output in outputs]

client = RewardModelClient("internlm/POLAR-7B", server_type="sglang", server_address="127.0.0.1:30000")
rewards = client(data)

sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)

for output, reward in sorted_res:
    print(f"Output: {output}
Reward: {reward}
")

Output: There are 3 'r's in the word 'strawberry'. Here's how we can count them: 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. So, the answer is 3.
Reward: 0.054595947265625

Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are three 'r's, so the answer is three.
Reward: -2.005859375

Output: There are 3 'r's in the word 'strawberry'.
Reward: -6.70703125

Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is three.
Reward: -7.10546875

Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
Reward: -7.1328125

Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
Reward: -8.46875

Output: There are 2 'r's in the word 'strawberry'.
Reward: -10.8203125

Open-ended questions

from xtuner.utils import RewardModelClient

prompt = "Summarize the first book of Frank Herbert’s Dune in one witty short sentence."
reference = "Royal teen discovers that life’s a beach—minus the ocean, plus spice, giant sandworms and deadly politics."
outputs = [
    # Same as the reference response.
    "Royal teen discovers that life’s a beach—minus the ocean, plus spice, giant sandworms and deadly politics.",
    # Closely resembles the reference response but includes factual errors.
    "Royal teen discovers that life’s a beach—minus the ocean, plus magic, dark wizards and deadly politics.",
    # A distinct yet concise and witty summary that draws analogies from other dramas—markedly different from the reference response.
    "Young noble’s move to desert planet turns into galactic Game of Thrones with fewer dragons, more worms.",
    # A concise summary, but lacking wit—fails to meet the requirement.
    "A noble family’s fall sparks a young heir’s rise as a leader on a harsh desert planet governed by prophecy and survival.",
    # A witty summary, but overly long—fails to meet the requirement.
    "Paul Atreides loses his father, gains prophetic powers, learns to ride a sandworm, leads a holy war, and discovers that being the chosen one comes with a lot of blood, sand, and questionable decisions.",
    # A concise and witty summary that draws from multiple Dune books rather than just the first—fails to follow the instruction.
    "Boy gets planet, becomes god, loses soul — family drama ensues across galaxies."
]
data = [{"prompt": prompt, "reference": reference, "output": output} for output in outputs]

client = RewardModelClient("internlm/POLAR-7B", server_type="sglang", server_address="127.0.0.1:30000")
rewards = client(data)

sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)

for output, reward in sorted_res:
    print(f"Output: {output}
Reward: {reward}
")

Output: Royal teen discovers that life’s a beach—minus the ocean, plus spice, giant sandworms and deadly politics.
Reward: 0.466552734375

Output: Young noble’s move to desert planet turns into galactic Game of Thrones with fewer dragons, more worms.
Reward: -6.91796875

Output: Royal teen discovers that life’s a beach—minus the ocean, plus magic, dark wizards and deadly politics.
Reward: -7.70703125

Output: Paul Atreides loses his father, gains prophetic powers, learns to ride a sandworm, leads a holy war, and discovers that being the chosen one comes with a lot of blood, sand, and questionable decisions.
Reward: -8.4296875

Output: A noble family’s fall sparks a young heir’s rise as a leader on a harsh desert planet governed by prophecy and survival.
Reward: -8.6484375

Output: Boy gets planet, becomes god, loses soul — family drama ensues across galaxies.
Reward: -10.359375

📄 License

Code and model weights are licensed under Apache-2.0.

📚 Documentation

POLAR-7B

POLAR-7B-Base refers to the pre-trained-only checkpoint, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoint POLAR-7B has been already fine-tuned on general preference data, making it suitable for immediate use in most scenarios.

We conducted a comprehensive evaluation of POLAR-7B via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using OpenCompass. More details are available in our Paper.

📦 Model Information

Property	Details
Base Model	internlm/internlm2_5-7b
Language	en, zh
License	apache-2.0
Tags	Reward, RL, RFT, Reward Model
Pipeline Tag	text-ranking
Library Name	transformers

📖 Citation

@article{dou2025pretrained,
  title={Pre-Trained Policy Discriminators are General Reward Models},
  author={Dou, Shihan and Liu, Shichun and Yang, Yuming and Zou, Yicheng and Zhou, Yunhua and Xing, Shuhao and Huang, Chenhao and Ge, Qiming and Song, Demin and Lv, Haijun and others},
  journal={arXiv preprint arXiv:2507.05197},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご