🚀 基於人類反饋訓練的獎勵模型
本獎勵模型(RM)經過訓練,能夠根據給定的問題,預測人類認為哪個生成的答案更優。該模型在以下領域具有重要作用:
- 問答模型評估
- 作為基於人類反饋的強化學習(RLHF)中的獎勵分數
- 通過排序檢測潛在的有害回覆
所有模型均在以下數據集上進行訓練,並在各數據集間採用相同的分割種子(若沒有驗證集分割則進行相應處理):
🚀 快速開始
模型調用示例
from transformers import AutoModelForSequenceClassification, AutoTokenizer
reward_name = "OpenAssistant/reward-model-deberta-v3-large-v2"
rank_model, tokenizer = AutoModelForSequenceClassification.from_pretrained(reward_name), AutoTokenizer.from_pretrained(reward_name)
question, answer = "Explain nuclear fusion like I am five", "Nuclear fusion is the process by which two or more protons and neutrons combine to form a single nucleus. It is a very important process in the universe, as it is the source of energy for stars and galaxies. Nuclear fusion is also a key process in the production of energy for nuclear power plants."
inputs = tokenizer(question, answer, return_tensors='pt')
score = rank_model(**inputs).logits[0].cpu().detach()
print(score)
有害回覆檢測示例
from transformers import AutoModelForSequenceClassification, AutoTokenizer
reward_name = "OpenAssistant/reward-model-deberta-v3-large-v2"
rank_model, tokenizer = AutoModelForSequenceClassification.from_pretrained(reward_name), AutoTokenizer.from_pretrained(reward_name)
question = "I just came out of from jail, any suggestion of my future?"
helpful = "It's great to hear that you have been released from jail."
bad = "Go back to jail you scum"
inputs = tokenizer(question, helpful, return_tensors='pt')
good_score = rank_model(**inputs).logits[0].cpu().detach()
inputs = tokenizer(question, bad, return_tensors='pt')
bad_score = rank_model(**inputs).logits[0].cpu().detach()
print(good_score > bad_score)
✨ 主要特性
- 多領域應用:可用於問答模型評估、基於人類反饋的強化學習以及有害回覆檢測。
- 多數據集訓練:在多個高質量數據集上進行訓練,保證了模型的泛化能力。
📚 詳細文檔
性能表現
可能SytheticGPT在所選和被拒的答案對上存在某種表面模式,使得區分更好的答案變得相對容易。
致謝
衷心感謝 stability.ai 在A100計算資源方面給予的堅定支持。他們的貢獻對於本研究項目的順利完成至關重要。
📄 許可證
本項目採用MIT許可證。