đ GRM-Llama3.2-3B Reward Model
This is a state-of-the-art 3B reward model that outperforms many larger models and can serve as a strong judge, even surpassing gpt4/gemini.
đ Quick Start
This reward model is fine-tuned from the Ray2333/GRM-llama3.2-3B-sftreg using the decontaminated Skywork preference dataset v0.2. It achieved a score of 90.9 on the reward model benchmark, making it a SOTA 3B reward model that can outperform a series of 8B reward models and even surpass gpt4/gemini as a judge.
Check our GRM series at đ¤hugging face, our paper at Arxiv, and github repo at Github.
⨠Features
- High Performance: Achieved a high score on the reward model benchmark, outperforming many larger models.
- Generalizable: Can be used as a judge in various scenarios, surpassing gpt4/gemini in some cases.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
device = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained('Ray2333/GRM-Llama3.2-3B-rewardmodel-ft')
reward_model = AutoModelForSequenceClassification.from_pretrained(
'Ray2333/GRM-Llama3.2-3B-rewardmodel-ft', torch_dtype=torch.float16,
device_map=device,
)
message = [
{'role': 'user', 'content': "I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her?"},
{'role': 'assistant', 'content': "Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?"}
]
message_template = tokenizer.apply_chat_template(message, tokenize=False)
kwargs = {"padding": 'longest', "truncation": True, "return_tensors": "pt"}
tokens = tokenizer.encode_plus(message_template, **kwargs)
with torch.no_grad():
reward_tensor = reward_model(tokens["input_ids"][0].view(1,-1).to(device), attention_mask=tokens["attention_mask"][0].view(1,-1).to(device))[0]
reward = reward_tensor.cpu().detach().item()
đ Documentation
Evaluation
We evaluate GRM-Llama3.2-3B-rewardmodel-ft on the reward model benchmark, where it achieved strong performance among models smaller than 7B.
â ī¸ Important Note
When evaluated using reward bench, please add '--not_quantized' to avoid performance drop.
đ§ Technical Details
No technical details are provided in the original document, so this section is skipped.
đ License
This project is licensed under the Apache-2.0 license.
đ Citation
If you find this model helpful for your research, please cite GRM
@inproceedings{yang2024regularizing,
title={Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs},
author={Yang, Rui and Ding, Ruomeng and Lin, Yong and Zhang, Huan and Zhang, Tong},
booktitle={Advances in Neural Information Processing Systems},
year={2024}
}