reward - model - deberta - v3 - large - v2開源獎勵模型 - 精準評估問答、檢測毒性回答

首頁

Reward Model Deberta V3 Large V2

由OpenAssistant開發

該獎勵模型經過訓練，能夠預測在給定問題下，人類會認為哪個生成的答案更好。適用於問答評估、RLHF獎勵評分和毒性回答檢測。

大型語言模型

Transformers

英語開源協議:MIT #人類反饋獎勵 #問答評估 #RLHF優化

下載量 11.15k

發布時間 : 2/1/2023

模型概述

基於多個人類反饋數據集訓練的序列分類模型，用於評估生成答案的質量和安全性。

模型特點

多數據集訓練

整合了WebGPT比較、摘要反饋、合成指令和人類偏好數據集

毒性檢測

可識別潛在有害或不適當的回答

跨領域適用

在問答、摘要和對話場景中均表現良好

模型能力

答案質量評分

回答對比較

有害內容檢測

RLHF獎勵信號生成

使用案例

問答系統

答案質量評估

評估AI生成答案的人類偏好程度

在WebGPT數據集上達到61.57%準確率

內容安全

毒性回答識別

檢測具有攻擊性或不當內容的回答

可有效區分建設性和有害回答

強化學習

RLHF獎勵模型

為強化學習從人類反饋中提供訓練信號

在Anthropic RLHF數據集上達到69.25%準確率

🚀 基於人類反饋訓練的獎勵模型

本獎勵模型（RM）經過訓練，能夠根據給定的問題，預測人類認為哪個生成的答案更優。該模型在以下領域具有重要作用：

問答模型評估
作為基於人類反饋的強化學習（RLHF）中的獎勵分數
通過排序檢測潛在的有害回覆

所有模型均在以下數據集上進行訓練，並在各數據集間採用相同的分割種子（若沒有驗證集分割則進行相應處理）：

🚀 快速開始

模型調用示例

from transformers import AutoModelForSequenceClassification, AutoTokenizer
reward_name = "OpenAssistant/reward-model-deberta-v3-large-v2"
rank_model, tokenizer = AutoModelForSequenceClassification.from_pretrained(reward_name), AutoTokenizer.from_pretrained(reward_name)
question, answer = "Explain nuclear fusion like I am five", "Nuclear fusion is the process by which two or more protons and neutrons combine to form a single nucleus. It is a very important process in the universe, as it is the source of energy for stars and galaxies. Nuclear fusion is also a key process in the production of energy for nuclear power plants."
inputs = tokenizer(question, answer, return_tensors='pt')
score = rank_model(**inputs).logits[0].cpu().detach()
print(score)

有害回覆檢測示例

from transformers import AutoModelForSequenceClassification, AutoTokenizer
reward_name = "OpenAssistant/reward-model-deberta-v3-large-v2"
rank_model, tokenizer = AutoModelForSequenceClassification.from_pretrained(reward_name), AutoTokenizer.from_pretrained(reward_name)

question = "I just came out of from jail, any suggestion of my future?"
helpful = "It's great to hear that you have been released from jail."
bad = "Go back to jail you scum"

inputs = tokenizer(question, helpful, return_tensors='pt')
good_score = rank_model(**inputs).logits[0].cpu().detach()

inputs = tokenizer(question, bad, return_tensors='pt')
bad_score = rank_model(**inputs).logits[0].cpu().detach()
print(good_score > bad_score) # tensor([True])

✨ 主要特性

多領域應用：可用於問答模型評估、基於人類反饋的強化學習以及有害回覆檢測。
多數據集訓練：在多個高質量數據集上進行訓練，保證了模型的泛化能力。

📚 詳細文檔

性能表現

模型	WebGPT	Summary	SytheticGPT	Anthropic RLHF
electra-large-discriminator	59.30	68.66	99.85	54.33
deberta-v3-large-v2	61.57	71.47	99.88	69.25
deberta-v3-large	61.13	72.23	99.94	55.62
deberta-v3-base	59.07	66.84	99.85	54.51
deberta-v2-xxlarge	58.67	73.27	99.77	66.74