PairRM-hf開源成對獎勵模型 - 高效評估大語言模型輸出質量

首頁

Pairrm Hf

由llm-blender開發

PairRM是一個高效的成對獎勵模型，用於比較和評估大語言模型的輸出質量。它基於DebertaV3架構，專門設計用於識別候選響應之間的細微差異。

大型語言模型

Transformers

英語開源協議:MIT #成對獎勵模型 #LLM評估器 #候選重排序

下載量 631

發布時間 : 1/5/2024

模型概述

PairRM是一個輕量級但高效的獎勵模型，用於比較兩個候選響應的相對質量。它支持多種應用場景，包括候選排序、對話比較和最佳n採樣。

模型特點

成對比較

同時評估一對候選響應，能夠識別細微的質量差異

高效輕量

基於0.4B參數的DebertaV3模型，計算效率高

多場景適用

支持排序、對話比較、最佳n採樣等多種應用場景

多數據集訓練

在6個人類偏好數據集上訓練，評估結果可靠

模型能力

文本質量評估

響應排序

對話比較

獎勵評分

使用案例

大語言模型評估

候選響應排序

對多個LLM生成的候選響應進行質量排序

可識別最佳響應，提升輸出質量

對話系統優化

多輪對話比較

比較兩個對話助手的整體表現

幫助選擇更優的對話策略

解碼增強

最佳n採樣

從多個採樣中選擇評分最高的響應

提升最終輸出的質量

🚀 成對獎勵模型（PairRM）

PairRM 是一個用於大語言模型（LLM）的成對獎勵模型，它以指令和一對輸出候選作為輸入，輸出每個候選的得分以衡量其相對質量。該模型可用於對候選輸出進行排序、評估 LLM 質量、增強解碼效果，還能輔助基於人類反饋的強化學習（RLHF）方法對指令微調後的 LLM 進行進一步對齊。

🚀 快速開始

這是與 Hugging Face 兼容的 llm-blender/PairRM 版本，可使用 DebertaV2PairRM 直接加載：

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from llm_blender.pair_ranker.pairrm import DebertaV2PairRM
from transformers import AutoTokenizer
from typing import List
pairrm = DebertaV2PairRM.from_pretrained("llm-blender/PairRM-hf", device_map="cuda:0").eval()
tokenizer = AutoTokenizer.from_pretrained('llm-blender/PairRM-hf')
source_prefix = "<|source|>"
cand1_prefix = "<|candidate1|>"
cand2_prefix = "<|candidate2|>"
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
def tokenize_pair(sources:List[str], candidate1s:List[str], candidate2s:List[str], source_max_length=1224, candidate_max_length=412):
    ids = []
    assert len(sources) == len(candidate1s) == len(candidate2s)
    max_length = source_max_length + 2 * candidate_max_length
    for i in range(len(sources)):
        source_ids = tokenizer.encode(source_prefix + sources[i], max_length=source_max_length, truncation=True)
        candidate_max_length = (max_length - len(source_ids)) // 2
        candidate1_ids = tokenizer.encode(cand1_prefix + candidate1s[i], max_length=candidate_max_length, truncation=True)
        candidate2_ids = tokenizer.encode(cand2_prefix + candidate2s[i], max_length=candidate_max_length, truncation=True)
        ids.append(source_ids + candidate1_ids + candidate2_ids)
    encodings = tokenizer.pad({"input_ids": ids}, return_tensors="pt", padding="max_length", max_length=max_length)
    return encodings

encodings = tokenize_pair(inputs, candidates_A, candidates_B)
encodings = {k:v.to(pairrm.device) for k,v in encodings.items()}
outputs = pairrm(**encodings)
logits = outputs.logits.tolist()
comparison_results = outputs.logits > 0
print(logits)
# [1.9003021717071533, -1.2547134160995483]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input

你也可以將 DebertaV2PairRM 代碼複製到本地文件，而不是從 llm-blender 包中導入。

上述代碼與使用原始 LLM-blender 包裝器的代碼產生的結果完全相同：

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import llm_blender
blender = llm_blender.Blender()
# Load Ranker
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
logits = blender.compare(inputs, candidates_A, candidates_B, return_logits=True, mode="[A,B]")
comparison_results = logits > 0
print(logits)
# [ 1.9   -1.255]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input

我們仍然建議使用 llm-blender 包裝器來使用 PairRM，因為已經實現了許多有用的應用函數來支持各種場景，例如排名、對話比較、最佳 n 採樣等。

你還可以輕鬆比較兩個對話，如下所示：

def tokenize_conv_pair(convAs: List[str], convBs: List[str]):
    """Compare two conversations by takeing USER turns as inputs and ASSISTANT turns as candidates
        Multi-turn conversations comparison is also supportted.
        a conversation format is:
        ```python
        [
            {
                "content": "hello",
                "role": "USER"
            },
            {
                "content": "hi",
                "role": "ASSISTANT"
            },
            ...
        ]
        ```
    Args:
        convAs (List[List[dict]]): List of conversations
        convAs (List[List[dict]]): List of conversations
    """

    for c in convAs + convBs:
        assert len(c) % 2 == 0, "Each conversation must have even number of turns"
        assert all([c[i]['role'] == 'USER' for i in range(0, len(c), 2)]), "Each even turn must be USER"
        assert all([c[i]['role'] == 'ASSISTANT' for i in range(1, len(c), 2)]), "Each odd turn must be ASSISTANT"
    # check conversations correctness
    assert len(convAs) == len(convBs), "Number of conversations must be the same"
    for c_a, c_b in zip(convAs, convBs):
        assert len(c_a) == len(c_b), "Number of turns in each conversation must be the same"
        assert all([c_a[i]['content'] == c_b[i]['content'] for i in range(0, len(c_a), 2)]), "USER turns must be the same"
    
    instructions = ["Finish the following coversation in each i-th turn by filling in <Response i> with your response."] * len(convAs)
    inputs = [
        "\n".join([
            "USER: " + x[i]['content'] +
            f"\nAssistant: <Response {i//2+1}>" for i in range(0, len(x), 2)
        ]) for x in convAs
    ]
    cand1_texts = [
        "\n".join([
            f"<Response {i//2+1}>: " + x[i]['content'] for i in range(1, len(x), 2)
        ]) for x in convAs
    ]
    cand2_texts = [
        "\n".join([
            f"<Response {i//2+1}>: " + x[i]['content'] for i in range(1, len(x), 2)
        ]) for x in convBs
    ]
    inputs = [inst + inp for inst, inp in zip(instructions, inputs)]
    encodings = tokenize_pair(inputs, cand1_texts, cand2_texts)
    return encodings

✨ 主要特性

高效比較：與其他分別對每個候選進行編碼和評分的獎勵模型不同，PairRM 以一對候選為輸入，並排比較它們，以識別它們之間的細微差異。
輕量級模型：基於 microsoft/deberta-v3-large，模型大小僅為 0.4B，但性能接近 GPT - 4。
多場景支持：可用於對候選輸出進行排序、評估 LLM 質量、增強解碼效果，還能輔助 RLHF 方法對指令微調後的 LLM 進行進一步對齊。

📦 安裝指南

首先安裝 llm-blender

pip install git+https://github.com/yuchenlin/LLM-Blender.git

然後加載 PairRM：

import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load PairRM

💻 使用示例

基礎用法

用例 1：根據指令比較/排名輸出候選

對候選響應列表進行排名

inputs = ["hello, how are you!", "I love you!"]
candidates_texts = [["get out!", "hi! I am fine, thanks!", "bye!"], 
                    ["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=1)
# ranks is a list of ranks
# ranks[i][j] represents the ranks of candidate-j for input-i
"""
ranks -->
array([[3, 1, 2], # it means "hi! I am fine, thanks!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd. 
       [1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
       dtype=int32) 

"""

直接比較兩個候選響應

inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where comparison_results[i] denotes
       # whether candidates_A[i] is better than candidates_B[i] for inputs[i]
# Example: comparison_results[0]--> True

比較兩個多輪對話。

conv1 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "[assistant1‘s response 1]",
        "role": "ASSISTANT"
    },
    ...
]
conv2 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "[assistant2's response 1]",
        "role": "ASSISTANT"
    },
    ...
]
comparison_results = blender.compare_conversations([conv1], [conv2])
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2

高級用法

用例 2：最佳 n 採樣（解碼增強）

最佳 n 採樣，也稱為拒絕採樣，是一種通過選擇獎勵模型排名最高的響應來提高響應質量的策略（更多信息請參閱 OpenAI WebGPT 第 3.2 節和 OpenAI 博客）。使用 PairRM 進行最佳 n 採樣是一種非常簡單的方法，只需對推理代碼進行少量更改即可改進你的 LLM：

# loading models 
import llm_blender
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
system_message = {"role": "system", "content": "You are a friendly chatbot."}

# formatting your inputs 
inputs = ["can you tell me a joke about OpenAI?"]
messages = [[system_message, {"role": "user", "content": _input}] for _input in inputs]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]

# Conventional generation method 
input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids
sampled_outputs = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
print(tokenizer.decode(sampled_outputs[0][len(input_ids[0]):], skip_special_tokens=False))
# --> The output could be a bad case such as a very short one, e.g., `Sure` 

# PairRM for best-of-n sampling 
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)

print("### Prompt:\n", prompts[0])
print("### best-of-n generations:\n", outputs[0])
# --> The output will be much more stable and consistently better than single sampling, for example: 
""" 
Sure, here's a joke about OpenAI:

Why did OpenAI decide to hire a mime as their new AI researcher?

Because they wanted someone who could communicate complex ideas without making a sound!

(Note: This is a joke, not a reflection of OpenAI's actual hiring practices.)
"""

用例 3：RLHF

PairRM 在各種高質量、大規模的帶有人類偏好註釋的數據集上進行了訓練，在極小的模型尺寸（0.4B）下與人類偏好表現出了很高的相關性，接近 GPT - 4 的性能。PairRM 將以更高效、更有效的方式幫助未來 LLM 的對齊。通過 blender.compare() 函數，你可以將 PairRM 應用於流行的 RLHF 工具包，如 trl。

🔥 查看我們的示例 Jupyter 筆記本用法的更多詳細信息：blender_usage.ipynb

📚 詳細文檔

GitHub：https://github.com/yuchenlin/LLM-Blender
論文：https://arxiv.org/abs/2306.02561
Space 演示：https://huggingface.co/spaces/llm-blender/LLM-Blender

🔧 技術細節

上下文長度

屬性	詳情
模型類型	PairRM
輸入最大長度	1224
候選最大長度	412
總最大長度	2048

訓練數據集

性能

PairRM 在各種高質量、大規模的帶有人類偏好註釋的數據集上進行了訓練，在極小的模型尺寸（0.4B）下與人類偏好表現出了很高的相關性，接近 GPT - 4 的性能。

我們在以下數據集上進行了成對比較測試：

[Auto - J 成對測試數據](https://github.com/GAIR - NLP/auto - j#pairwise - response - comparison)
HHH - 對齊
MT - bench 人工判斷

所有結果均以成對比較準確率（一致性）報告。

Auto - J 成對測試數據性能

模型	總結	考試	代碼	重寫	創意寫作	功能寫作	交流	NLP	總體
閉源模型
ChatGPT	33.3	40.3	36.6	31.6	48.2	40.4	47.6	45.8	42.7
Claude - 2	30.6	36.1	41.7	34.2	48.1	42.5	40.6	48.5	42.4
GPT - 4	59.7	51.4	69.2	58.3	66.7	60.4	58.3	65.2	61.9
開源模型
SteamSHP	33.3	29.2	26.7	33.3	40.7	31.3	51.4	51.9	40.6
PandaLM	29.2	33.3	31.7	23.3	43.5	32.9	44.8	48.9	38.9
LLaMA - 2 - chat - 13B	20.8	27.8	19.2	20	31.5	27.5	35.8	31.8	29
Vicuna - 13B - v1.5	30.6	23.6	35	28.3	36.1	37.5	45.5	39.8	37.3
WizardLM - 13B - v1.2	22.2	20.8	32.5	19.2	28.7	25.4	29.2	33	27.8
LLAMA - 2 - chat - 70B	34.7	33.3	36.7	35.8	51.4	54.2	47.2	47.7	45.9
AUTO - J (13b)	45.8	38.9	59.2	47.5	54.6	57.1	58	57.6	54.8
UltraRM (13b)	56.94	43.06	55.0	53.33	67.13	64.17	56.25	59.85	59.85
PairRM (0.4b)	56.94	52.78	58.33	55.83	61.57	59.17	57.64	62.5	59.05

HHH - 對齊和 MT - bench 人工判斷

評估器 LM	HHH 對齊					MT - bench 人工判斷
	幫助	傷害	誠實	其他	總平均	人類偏好
隨機	50	50	50	50	50	34.26
STANFORDNLP 獎勵模型	69.49	60.34	52.46	51.16	58.82	44.79
ALMOST 獎勵模型	74.58	67.24	78.69	86.05	76.02	49.9
LLAMA2 - CHAT 7B	66.1	81.03	70.49	74.42	72.85	51.78
LLAMA2 - CHAT 13B	74.58	87.93	55.74	79.07	73.76	52.34
LLAMA2 - CHAT 70B	66.1	89.66	67.21	74.42	74.21	53.67
LLAMA2 - CHAT 13B + COARSE	68.74	68.97	65.57	67.44	67.42	46.89
GPT - 3.5 - TURBO - 0613	76.27	87.93	67.21	86.05	78.73	57.12
PROMETHEUS 7B	69.49	84.48	78.69	90.7	80.09	55.14
PROMETHEUS 13B	81.36	82.76	75.41	76.74	79.19	57.72
UltraRM (13B)	86.44	79.31	81.97	88.37	83.71	56
PairRM (0.4B)	84.75	84.48	80.33	90.7	84.62	59
GPT - 4 - 0613	91.53	93.1	85.25	83.72	88.69	63.87

雖然 PairRM 是一個基於 DeBERTa 的極小模型（0.4B），但其成對比較一致性性能接近 GPT - 4 的性能！

這歸因於兩個原因：

我們的 PairRM 專門設計了用於成對比較的模型架構，通過雙向注意力機制實現（更多細節請參閱 LLM - blender 論文）。
它在高質量、大規模的人類偏好註釋數據上進行了訓練（請參閱此 Hugging Face 頁面上的訓練數據集列表）。

📄 許可證

本項目採用 MIT 許可證。

📖 引用與致謝

如果您在研究中使用了 PairRM，請引用 LLM - blender：

@inproceedings{llm-blender-2023,
    title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
    author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
    booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
    year = "2023"
}