P

POLAR 7B

Developed by internlm
POLAR-7B is a scalar reward model based on large-scale pretraining. It adopts an innovative policy discriminative learning paradigm and can effectively distinguish policies and align with human preferences.
Downloads 316
Release Time : 7/4/2025

Model Overview

POLAR-7B is a scalar-based reward model designed specifically for reinforcement learning. Through large-scale pretraining and fine-tuning with a small amount of preference data, it can quickly align with human preferences and is suitable for text ranking tasks.

Model Features

Innovative pretraining paradigm
POLAR trains a reward model to identify the same policies and distinguish different policies, capturing the relative differences between policies.
Designed specifically for reinforcement fine-tuning
POLAR assigns rewards to large language model trajectories based on a given reference, which fits perfectly with the Reinforcement Fine-Tuning (RFT) framework.
Excellent performance and generalization ability
POLAR has achieved state-of-the-art results in downstream reinforcement learning tasks, can effectively generalize to unseen scenarios, and significantly reduces the reward hacking problem.
Easy to customize
Pretrained checkpoints are provided, enabling researchers to conveniently fine-tune the reward model for various customized scenarios.

Model Capabilities

Policy discrimination
Text ranking
Reward signal generation
Reinforcement learning support

Use Cases

Closed-ended question answering
Counting questions
Evaluate the accuracy of answers to counting questions
Can accurately distinguish correct and incorrect counting answers
Open-ended question answering
Book summary
Evaluate the quality of summaries of book content
Can identify high-quality, concise, and compliant summaries
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase