🚀 獎勵模型(基於Gemma - 2b - it)
這是一個獎勵模型(基於Gemma - 2b - it),使用weqweasdas/preference_dataset_mixture2_and_safe_pku數據集,通過BT損失進行訓練。該獎勵模型在需要為大語言模型(LLMs)配備一個出色的小型獎勵模型時尤為實用。你還可以參考[Ray2333/GRM - Gemma - 2B - sftreg](https://huggingface.co/Ray2333/GRM - Gemma - 2B - sftreg),這是一個通過隱藏狀態正則化訓練的更優的2B獎勵模型。
🚀 快速開始
模型評估
我們在[獎勵模型基準測試](https://huggingface.co/spaces/allenai/reward - bench)上對該獎勵模型進行了評估。
模型 |
平均分 |
對話 |
困難對話 |
安全性 |
推理能力 |
[Ray2333/GRM - Gemma - 2B - sftreg](https://huggingface.co/Ray2333/GRM - Gemma - 2B - sftreg)(我們的,2B) |
75.3 |
95.5 |
48.7 |
80.0 |
76.8 |
berkeley - nest/Starling - RM - 7B - alpha(7B) |
74.6 |
98 |
43.4 |
88.6 |
74.6 |
Ray2333/Gemma - 2B - rewardmodel - baseline(我們的,2B) |
73.7 |
94.1 |
46.1 |
79.6 |
75.0 |
stabilityai/stablelm - zephyr - 3b(3B) |
73.1 |
86.3 |
60.1 |
70.3 |
75.7 |
openbmb/UltraRM - 13b(13B) |
71.3 |
96.1 |
55.3 |
45.8 |
82 |
💻 使用示例
基礎用法
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('Ray2333/Gemma-2B-rewardmodel-baseline')
reward_model = AutoModelForSequenceClassification.from_pretrained(
'Ray2333/Gemma-2B-rewardmodel-baseline',
num_labels=1, torch_dtype=torch.float16,
device_map=0,
)
message = [
{'role': 'user', 'content': "I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her?"},
{'role': 'assistant', 'content': "Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?"}
]
message_template = tokenizer.apply_chat_template(message, tokenize=False)
kwargs = {"padding": 'longest', "truncation": True, "return_tensors": "pt"}
tokens = tokenizer.encode_plus(message_template, **kwargs)
with torch.no_grad():
reward_tensor = model(tokens["input_ids"][0].to(model.device), attention_mask=tokens["attention_mask"][0].to(model.device)).logits.reshape(-1)
reward = reward_tensor.cpu().detach().item()
📄 許可證
本項目採用MIT許可證。