PairRM-hf开源成对奖励模型 - 高效评估大语言模型输出质量

首页

Pairrm Hf

由 llm-blender 开发

PairRM是一个高效的成对奖励模型，用于比较和评估大语言模型的输出质量。它基于DebertaV3架构，专门设计用于识别候选响应之间的细微差异。

大型语言模型

Transformers

英语开源协议:MIT #成对奖励模型 #LLM评估器 #候选重排序

下载量 631

发布时间 : 1/5/2024

模型简介

PairRM是一个轻量级但高效的奖励模型，用于比较两个候选响应的相对质量。它支持多种应用场景，包括候选排序、对话比较和最佳n采样。

模型特点

成对比较

同时评估一对候选响应，能够识别细微的质量差异

高效轻量

基于0.4B参数的DebertaV3模型，计算效率高

多场景适用

支持排序、对话比较、最佳n采样等多种应用场景

多数据集训练

在6个人类偏好数据集上训练，评估结果可靠

模型能力

文本质量评估

响应排序

对话比较

奖励评分

使用案例

大语言模型评估

候选响应排序

对多个LLM生成的候选响应进行质量排序

可识别最佳响应，提升输出质量

对话系统优化

多轮对话比较

比较两个对话助手的整体表现

帮助选择更优的对话策略

解码增强

最佳n采样

从多个采样中选择评分最高的响应

提升最终输出的质量

🚀 成对奖励模型（PairRM）

PairRM 是一个用于大语言模型（LLM）的成对奖励模型，它以指令和一对输出候选作为输入，输出每个候选的得分以衡量其相对质量。该模型可用于对候选输出进行排序、评估 LLM 质量、增强解码效果，还能辅助基于人类反馈的强化学习（RLHF）方法对指令微调后的 LLM 进行进一步对齐。

🚀 快速开始

这是与 Hugging Face 兼容的 llm-blender/PairRM 版本，可使用 DebertaV2PairRM 直接加载：

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from llm_blender.pair_ranker.pairrm import DebertaV2PairRM
from transformers import AutoTokenizer
from typing import List
pairrm = DebertaV2PairRM.from_pretrained("llm-blender/PairRM-hf", device_map="cuda:0").eval()
tokenizer = AutoTokenizer.from_pretrained('llm-blender/PairRM-hf')
source_prefix = "<|source|>"
cand1_prefix = "<|candidate1|>"
cand2_prefix = "<|candidate2|>"
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
def tokenize_pair(sources:List[str], candidate1s:List[str], candidate2s:List[str], source_max_length=1224, candidate_max_length=412):
    ids = []
    assert len(sources) == len(candidate1s) == len(candidate2s)
    max_length = source_max_length + 2 * candidate_max_length
    for i in range(len(sources)):
        source_ids = tokenizer.encode(source_prefix + sources[i], max_length=source_max_length, truncation=True)
        candidate_max_length = (max_length - len(source_ids)) // 2
        candidate1_ids = tokenizer.encode(cand1_prefix + candidate1s[i], max_length=candidate_max_length, truncation=True)
        candidate2_ids = tokenizer.encode(cand2_prefix + candidate2s[i], max_length=candidate_max_length, truncation=True)
        ids.append(source_ids + candidate1_ids + candidate2_ids)
    encodings = tokenizer.pad({"input_ids": ids}, return_tensors="pt", padding="max_length", max_length=max_length)
    return encodings

encodings = tokenize_pair(inputs, candidates_A, candidates_B)
encodings = {k:v.to(pairrm.device) for k,v in encodings.items()}
outputs = pairrm(**encodings)
logits = outputs.logits.tolist()
comparison_results = outputs.logits > 0
print(logits)
# [1.9003021717071533, -1.2547134160995483]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input

你也可以将 DebertaV2PairRM 代码复制到本地文件，而不是从 llm-blender 包中导入。

上述代码与使用原始 LLM-blender 包装器的代码产生的结果完全相同：

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import llm_blender
blender = llm_blender.Blender()
# Load Ranker
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
logits = blender.compare(inputs, candidates_A, candidates_B, return_logits=True, mode="[A,B]")
comparison_results = logits > 0
print(logits)
# [ 1.9   -1.255]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input

我们仍然建议使用 llm-blender 包装器来使用 PairRM，因为已经实现了许多有用的应用函数来支持各种场景，例如排名、对话比较、最佳 n 采样等。

你还可以轻松比较两个对话，如下所示：

def tokenize_conv_pair(convAs: List[str], convBs: List[str]):
    """Compare two conversations by takeing USER turns as inputs and ASSISTANT turns as candidates
        Multi-turn conversations comparison is also supportted.
        a conversation format is:
        ```python
        [
            {
                "content": "hello",
                "role": "USER"
            },
            {
                "content": "hi",
                "role": "ASSISTANT"
            },
            ...
        ]
        ```
    Args:
        convAs (List[List[dict]]): List of conversations
        convAs (List[List[dict]]): List of conversations
    """

    for c in convAs + convBs:
        assert len(c) % 2 == 0, "Each conversation must have even number of turns"
        assert all([c[i]['role'] == 'USER' for i in range(0, len(c), 2)]), "Each even turn must be USER"
        assert all([c[i]['role'] == 'ASSISTANT' for i in range(1, len(c), 2)]), "Each odd turn must be ASSISTANT"
    # check conversations correctness
    assert len(convAs) == len(convBs), "Number of conversations must be the same"
    for c_a, c_b in zip(convAs, convBs):
        assert len(c_a) == len(c_b), "Number of turns in each conversation must be the same"
        assert all([c_a[i]['content'] == c_b[i]['content'] for i in range(0, len(c_a), 2)]), "USER turns must be the same"
    
    instructions = ["Finish the following coversation in each i-th turn by filling in <Response i> with your response."] * len(convAs)
    inputs = [
        "\n".join([
            "USER: " + x[i]['content'] +
            f"\nAssistant: <Response {i//2+1}>" for i in range(0, len(x), 2)
        ]) for x in convAs
    ]
    cand1_texts = [
        "\n".join([
            f"<Response {i//2+1}>: " + x[i]['content'] for i in range(1, len(x), 2)
        ]) for x in convAs
    ]
    cand2_texts = [
        "\n".join([
            f"<Response {i//2+1}>: " + x[i]['content'] for i in range(1, len(x), 2)
        ]) for x in convBs
    ]
    inputs = [inst + inp for inst, inp in zip(instructions, inputs)]
    encodings = tokenize_pair(inputs, cand1_texts, cand2_texts)
    return encodings

✨ 主要特性

高效比较：与其他分别对每个候选进行编码和评分的奖励模型不同，PairRM 以一对候选为输入，并排比较它们，以识别它们之间的细微差异。
轻量级模型：基于 microsoft/deberta-v3-large，模型大小仅为 0.4B，但性能接近 GPT - 4。
多场景支持：可用于对候选输出进行排序、评估 LLM 质量、增强解码效果，还能辅助 RLHF 方法对指令微调后的 LLM 进行进一步对齐。

📦 安装指南

首先安装 llm-blender

pip install git+https://github.com/yuchenlin/LLM-Blender.git

然后加载 PairRM：

import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load PairRM

💻 使用示例

基础用法

用例 1：根据指令比较/排名输出候选

对候选响应列表进行排名

inputs = ["hello, how are you!", "I love you!"]
candidates_texts = [["get out!", "hi! I am fine, thanks!", "bye!"], 
                    ["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=1)
# ranks is a list of ranks
# ranks[i][j] represents the ranks of candidate-j for input-i
"""
ranks -->
array([[3, 1, 2], # it means "hi! I am fine, thanks!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd. 
       [1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
       dtype=int32) 

"""

直接比较两个候选响应

inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where comparison_results[i] denotes
       # whether candidates_A[i] is better than candidates_B[i] for inputs[i]
# Example: comparison_results[0]--> True

比较两个多轮对话。

conv1 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "[assistant1‘s response 1]",
        "role": "ASSISTANT"
    },
    ...
]
conv2 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "[assistant2's response 1]",
        "role": "ASSISTANT"
    },
    ...
]
comparison_results = blender.compare_conversations([conv1], [conv2])
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2

高级用法

用例 2：最佳 n 采样（解码增强）

最佳 n 采样，也称为拒绝采样，是一种通过选择奖励模型排名最高的响应来提高响应质量的策略（更多信息请参阅 OpenAI WebGPT 第 3.2 节和 OpenAI 博客）。使用 PairRM 进行最佳 n 采样是一种非常简单的方法，只需对推理代码进行少量更改即可改进你的 LLM：

# loading models 
import llm_blender
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
system_message = {"role": "system", "content": "You are a friendly chatbot."}

# formatting your inputs 
inputs = ["can you tell me a joke about OpenAI?"]
messages = [[system_message, {"role": "user", "content": _input}] for _input in inputs]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]

# Conventional generation method 
input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids
sampled_outputs = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
print(tokenizer.decode(sampled_outputs[0][len(input_ids[0]):], skip_special_tokens=False))
# --> The output could be a bad case such as a very short one, e.g., `Sure` 

# PairRM for best-of-n sampling 
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)

print("### Prompt:\n", prompts[0])
print("### best-of-n generations:\n", outputs[0])
# --> The output will be much more stable and consistently better than single sampling, for example: 
""" 
Sure, here's a joke about OpenAI:

Why did OpenAI decide to hire a mime as their new AI researcher?

Because they wanted someone who could communicate complex ideas without making a sound!

(Note: This is a joke, not a reflection of OpenAI's actual hiring practices.)
"""

用例 3：RLHF

PairRM 在各种高质量、大规模的带有人类偏好注释的数据集上进行了训练，在极小的模型尺寸（0.4B）下与人类偏好表现出了很高的相关性，接近 GPT - 4 的性能。PairRM 将以更高效、更有效的方式帮助未来 LLM 的对齐。通过 blender.compare() 函数，你可以将 PairRM 应用于流行的 RLHF 工具包，如 trl。

🔥 查看我们的示例 Jupyter 笔记本用法的更多详细信息：blender_usage.ipynb

📚 详细文档

GitHub：https://github.com/yuchenlin/LLM-Blender
论文：https://arxiv.org/abs/2306.02561
Space 演示：https://huggingface.co/spaces/llm-blender/LLM-Blender

🔧 技术细节

上下文长度

属性	详情
模型类型	PairRM
输入最大长度	1224
候选最大长度	412
总最大长度	2048

训练数据集

性能

PairRM 在各种高质量、大规模的带有人类偏好注释的数据集上进行了训练，在极小的模型尺寸（0.4B）下与人类偏好表现出了很高的相关性，接近 GPT - 4 的性能。

我们在以下数据集上进行了成对比较测试：

[Auto - J 成对测试数据](https://github.com/GAIR - NLP/auto - j#pairwise - response - comparison)
HHH - 对齐
MT - bench 人工判断

所有结果均以成对比较准确率（一致性）报告。

Auto - J 成对测试数据性能

模型	总结	考试	代码	重写	创意写作	功能写作	交流	NLP	总体
闭源模型
ChatGPT	33.3	40.3	36.6	31.6	48.2	40.4	47.6	45.8	42.7
Claude - 2	30.6	36.1	41.7	34.2	48.1	42.5	40.6	48.5	42.4
GPT - 4	59.7	51.4	69.2	58.3	66.7	60.4	58.3	65.2	61.9
开源模型
SteamSHP	33.3	29.2	26.7	33.3	40.7	31.3	51.4	51.9	40.6
PandaLM	29.2	33.3	31.7	23.3	43.5	32.9	44.8	48.9	38.9
LLaMA - 2 - chat - 13B	20.8	27.8	19.2	20	31.5	27.5	35.8	31.8	29
Vicuna - 13B - v1.5	30.6	23.6	35	28.3	36.1	37.5	45.5	39.8	37.3
WizardLM - 13B - v1.2	22.2	20.8	32.5	19.2	28.7	25.4	29.2	33	27.8
LLAMA - 2 - chat - 70B	34.7	33.3	36.7	35.8	51.4	54.2	47.2	47.7	45.9
AUTO - J (13b)	45.8	38.9	59.2	47.5	54.6	57.1	58	57.6	54.8
UltraRM (13b)	56.94	43.06	55.0	53.33	67.13	64.17	56.25	59.85	59.85
PairRM (0.4b)	56.94	52.78	58.33	55.83	61.57	59.17	57.64	62.5	59.05

HHH - 对齐和 MT - bench 人工判断

评估器 LM	HHH 对齐					MT - bench 人工判断
	帮助	伤害	诚实	其他	总平均	人类偏好
随机	50	50	50	50	50	34.26
STANFORDNLP 奖励模型	69.49	60.34	52.46	51.16	58.82	44.79
ALMOST 奖励模型	74.58	67.24	78.69	86.05	76.02	49.9
LLAMA2 - CHAT 7B	66.1	81.03	70.49	74.42	72.85	51.78
LLAMA2 - CHAT 13B	74.58	87.93	55.74	79.07	73.76	52.34
LLAMA2 - CHAT 70B	66.1	89.66	67.21	74.42	74.21	53.67
LLAMA2 - CHAT 13B + COARSE	68.74	68.97	65.57	67.44	67.42	46.89
GPT - 3.5 - TURBO - 0613	76.27	87.93	67.21	86.05	78.73	57.12
PROMETHEUS 7B	69.49	84.48	78.69	90.7	80.09	55.14
PROMETHEUS 13B	81.36	82.76	75.41	76.74	79.19	57.72
UltraRM (13B)	86.44	79.31	81.97	88.37	83.71	56
PairRM (0.4B)	84.75	84.48	80.33	90.7	84.62	59
GPT - 4 - 0613	91.53	93.1	85.25	83.72	88.69	63.87

虽然 PairRM 是一个基于 DeBERTa 的极小模型（0.4B），但其成对比较一致性性能接近 GPT - 4 的性能！

这归因于两个原因：

我们的 PairRM 专门设计了用于成对比较的模型架构，通过双向注意力机制实现（更多细节请参阅 LLM - blender 论文）。
它在高质量、大规模的人类偏好注释数据上进行了训练（请参阅此 Hugging Face 页面上的训练数据集列表）。

📄 许可证

本项目采用 MIT 许可证。

📖 引用与致谢

如果您在研究中使用了 PairRM，请引用 LLM - blender：

@inproceedings{llm-blender-2023,
    title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
    author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
    booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
    year = "2023"
}