đ Reward Model for LLMs
This is a reward model designed for Large Language Models (LLMs). It offers a compact yet effective solution for evaluating responses, especially useful when a small - sized reward model is required.
đ Quick Start
This is a reward model (based on Gemma - 2b - it) trained with BT loss using the weqweasdas/preference_dataset_mixture2_and_safe_pku dataset.
This reward model is especially useful if you need a good small reward model for LLMs. You can also refer to [Ray2333/GRM - Gemma - 2B - sftreg](https://huggingface.co/Ray2333/GRM - Gemma - 2B - sftreg) for a better 2B reward model trained with a hidden states regularization.
⨠Features
- Trained on a specific dataset with BT loss, providing a reliable evaluation metric for LLMs.
- A good choice for scenarios where a small - sized reward model is needed.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('Ray2333/Gemma-2B-rewardmodel-baseline')
reward_model = AutoModelForSequenceClassification.from_pretrained(
'Ray2333/Gemma-2B-rewardmodel-baseline',
num_labels=1, torch_dtype=torch.float16,
device_map=0,
)
message = [
{'role': 'user', 'content': "I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her?"},
{'role': 'assistant', 'content': "Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?"}
]
message_template = tokenizer.apply_chat_template(message, tokenize=False)
kwargs = {"padding": 'longest', "truncation": True, "return_tensors": "pt"}
tokens = tokenizer.encode_plus(message_template, **kwargs)
with torch.no_grad():
reward_tensor = model(tokens["input_ids"][0].to(model.device), attention_mask=tokens["attention_mask"][0].to(model.device)).logits.reshape(-1)
reward = reward_tensor.cpu().detach().item()
đ Documentation
Evaluation
We evaluate this reward model on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward - bench).
Model |
Average |
Chat |
Chat Hard |
Safety |
Reasoning |
[Ray2333/GRM - Gemma - 2B - sftreg](https://huggingface.co/Ray2333/GRM - Gemma - 2B - sftreg)(Ours, 2B) |
75.3 |
95.5 |
48.7 |
80.0 |
76.8 |
berkeley - nest/Starling - RM - 7B - alpha (7B) |
74.6 |
98 |
43.4 |
88.6 |
74.6 |
Ray2333/Gemma - 2B - rewardmodel - baseline(Ours, 2B) |
73.7 |
94.1 |
46.1 |
79.6 |
75.0 |
stabilityai/stablelm - zephyr - 3b (3B) |
73.1 |
86.3 |
60.1 |
70.3 |
75.7 |
openbmb/UltraRM - 13b (13B) |
71.3 |
96.1 |
55.3 |
45.8 |
82 |
đ License
This project is licensed under the MIT License.