模型简介
模型特点
模型能力
使用案例
🚀 成对奖励模型(PairRM)
PairRM 是一个用于大语言模型(LLM)的成对奖励模型,它以指令和一对输出候选作为输入,输出每个候选的得分以衡量其相对质量。该模型可用于对候选输出进行排序、评估 LLM 质量、增强解码效果,还能辅助基于人类反馈的强化学习(RLHF)方法对指令微调后的 LLM 进行进一步对齐。
🚀 快速开始
这是与 Hugging Face 兼容的 llm-blender/PairRM 版本,可使用 DebertaV2PairRM
直接加载:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from llm_blender.pair_ranker.pairrm import DebertaV2PairRM
from transformers import AutoTokenizer
from typing import List
pairrm = DebertaV2PairRM.from_pretrained("llm-blender/PairRM-hf", device_map="cuda:0").eval()
tokenizer = AutoTokenizer.from_pretrained('llm-blender/PairRM-hf')
source_prefix = "<|source|>"
cand1_prefix = "<|candidate1|>"
cand2_prefix = "<|candidate2|>"
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
def tokenize_pair(sources:List[str], candidate1s:List[str], candidate2s:List[str], source_max_length=1224, candidate_max_length=412):
ids = []
assert len(sources) == len(candidate1s) == len(candidate2s)
max_length = source_max_length + 2 * candidate_max_length
for i in range(len(sources)):
source_ids = tokenizer.encode(source_prefix + sources[i], max_length=source_max_length, truncation=True)
candidate_max_length = (max_length - len(source_ids)) // 2
candidate1_ids = tokenizer.encode(cand1_prefix + candidate1s[i], max_length=candidate_max_length, truncation=True)
candidate2_ids = tokenizer.encode(cand2_prefix + candidate2s[i], max_length=candidate_max_length, truncation=True)
ids.append(source_ids + candidate1_ids + candidate2_ids)
encodings = tokenizer.pad({"input_ids": ids}, return_tensors="pt", padding="max_length", max_length=max_length)
return encodings
encodings = tokenize_pair(inputs, candidates_A, candidates_B)
encodings = {k:v.to(pairrm.device) for k,v in encodings.items()}
outputs = pairrm(**encodings)
logits = outputs.logits.tolist()
comparison_results = outputs.logits > 0
print(logits)
# [1.9003021717071533, -1.2547134160995483]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input
你也可以将 DebertaV2PairRM
代码复制到本地文件,而不是从 llm-blender
包中导入。
上述代码与使用原始 LLM-blender 包装器的代码产生的结果完全相同:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import llm_blender
blender = llm_blender.Blender()
# Load Ranker
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
logits = blender.compare(inputs, candidates_A, candidates_B, return_logits=True, mode="[A,B]")
comparison_results = logits > 0
print(logits)
# [ 1.9 -1.255]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input
我们仍然建议使用 llm-blender 包装器来使用 PairRM,因为已经实现了许多有用的应用函数来支持各种场景,例如排名、对话比较、最佳 n 采样等。
你还可以轻松比较两个对话,如下所示:
def tokenize_conv_pair(convAs: List[str], convBs: List[str]):
"""Compare two conversations by takeing USER turns as inputs and ASSISTANT turns as candidates
Multi-turn conversations comparison is also supportted.
a conversation format is:
```python
[
{
"content": "hello",
"role": "USER"
},
{
"content": "hi",
"role": "ASSISTANT"
},
...
]
```
Args:
convAs (List[List[dict]]): List of conversations
convAs (List[List[dict]]): List of conversations
"""
for c in convAs + convBs:
assert len(c) % 2 == 0, "Each conversation must have even number of turns"
assert all([c[i]['role'] == 'USER' for i in range(0, len(c), 2)]), "Each even turn must be USER"
assert all([c[i]['role'] == 'ASSISTANT' for i in range(1, len(c), 2)]), "Each odd turn must be ASSISTANT"
# check conversations correctness
assert len(convAs) == len(convBs), "Number of conversations must be the same"
for c_a, c_b in zip(convAs, convBs):
assert len(c_a) == len(c_b), "Number of turns in each conversation must be the same"
assert all([c_a[i]['content'] == c_b[i]['content'] for i in range(0, len(c_a), 2)]), "USER turns must be the same"
instructions = ["Finish the following coversation in each i-th turn by filling in <Response i> with your response."] * len(convAs)
inputs = [
"\n".join([
"USER: " + x[i]['content'] +
f"\nAssistant: <Response {i//2+1}>" for i in range(0, len(x), 2)
]) for x in convAs
]
cand1_texts = [
"\n".join([
f"<Response {i//2+1}>: " + x[i]['content'] for i in range(1, len(x), 2)
]) for x in convAs
]
cand2_texts = [
"\n".join([
f"<Response {i//2+1}>: " + x[i]['content'] for i in range(1, len(x), 2)
]) for x in convBs
]
inputs = [inst + inp for inst, inp in zip(instructions, inputs)]
encodings = tokenize_pair(inputs, cand1_texts, cand2_texts)
return encodings
✨ 主要特性
- 高效比较:与其他分别对每个候选进行编码和评分的奖励模型不同,PairRM 以一对候选为输入,并排比较它们,以识别它们之间的细微差异。
- 轻量级模型:基于
microsoft/deberta-v3-large
,模型大小仅为 0.4B,但性能接近 GPT - 4。 - 多场景支持:可用于对候选输出进行排序、评估 LLM 质量、增强解码效果,还能辅助 RLHF 方法对指令微调后的 LLM 进行进一步对齐。
📦 安装指南
- 首先安装
llm-blender
pip install git+https://github.com/yuchenlin/LLM-Blender.git
- 然后加载 PairRM:
import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load PairRM
💻 使用示例
基础用法
用例 1:根据指令比较/排名输出候选
- 对候选响应列表进行排名
inputs = ["hello, how are you!", "I love you!"]
candidates_texts = [["get out!", "hi! I am fine, thanks!", "bye!"],
["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=1)
# ranks is a list of ranks
# ranks[i][j] represents the ranks of candidate-j for input-i
"""
ranks -->
array([[3, 1, 2], # it means "hi! I am fine, thanks!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd.
[1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
dtype=int32)
"""
- 直接比较两个候选响应
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where comparison_results[i] denotes
# whether candidates_A[i] is better than candidates_B[i] for inputs[i]
# Example: comparison_results[0]--> True
比较两个多轮对话。
conv1 = [
{
"content": "hello",
"role": "USER"
},
{
"content": "[assistant1‘s response 1]",
"role": "ASSISTANT"
},
...
]
conv2 = [
{
"content": "hello",
"role": "USER"
},
{
"content": "[assistant2's response 1]",
"role": "ASSISTANT"
},
...
]
comparison_results = blender.compare_conversations([conv1], [conv2])
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2
高级用法
用例 2:最佳 n 采样(解码增强)
最佳 n 采样,也称为拒绝采样,是一种通过选择奖励模型排名最高的响应来提高响应质量的策略(更多信息请参阅 OpenAI WebGPT 第 3.2 节 和 OpenAI 博客)。使用 PairRM 进行最佳 n 采样是一种非常简单的方法,只需对推理代码进行少量更改即可改进你的 LLM:
# loading models
import llm_blender
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
system_message = {"role": "system", "content": "You are a friendly chatbot."}
# formatting your inputs
inputs = ["can you tell me a joke about OpenAI?"]
messages = [[system_message, {"role": "user", "content": _input}] for _input in inputs]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]
# Conventional generation method
input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids
sampled_outputs = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
print(tokenizer.decode(sampled_outputs[0][len(input_ids[0]):], skip_special_tokens=False))
# --> The output could be a bad case such as a very short one, e.g., `Sure`
# PairRM for best-of-n sampling
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)
print("### Prompt:\n", prompts[0])
print("### best-of-n generations:\n", outputs[0])
# --> The output will be much more stable and consistently better than single sampling, for example:
"""
Sure, here's a joke about OpenAI:
Why did OpenAI decide to hire a mime as their new AI researcher?
Because they wanted someone who could communicate complex ideas without making a sound!
(Note: This is a joke, not a reflection of OpenAI's actual hiring practices.)
"""
用例 3:RLHF
PairRM 在各种高质量、大规模的带有人类偏好注释的数据集上进行了训练,在极小的模型尺寸(0.4B)下与人类偏好表现出了很高的相关性,接近 GPT - 4 的性能。PairRM 将以更高效、更有效的方式帮助未来 LLM 的对齐。通过 blender.compare()
函数,你可以将 PairRM 应用于流行的 RLHF 工具包,如 trl。
🔥 查看我们的示例 Jupyter 笔记本用法的更多详细信息:blender_usage.ipynb
📚 详细文档
- GitHub:https://github.com/yuchenlin/LLM-Blender
- 论文:https://arxiv.org/abs/2306.02561
- Space 演示:https://huggingface.co/spaces/llm-blender/LLM-Blender
🔧 技术细节
上下文长度
属性 | 详情 |
---|---|
模型类型 | PairRM |
输入最大长度 | 1224 |
候选最大长度 | 412 |
总最大长度 | 2048 |
训练数据集
- openai/summarize_from_feedback
- openai/webgpt_comparisons
- Dahoas/instruct-synthetic-prompt-responses
- Anthropic/hh-rlhf
- lmsys/chatbot_arena_conversations
- openbmb/UltraFeedback
性能
PairRM 在各种高质量、大规模的带有人类偏好注释的数据集上进行了训练,在极小的模型尺寸(0.4B)下与人类偏好表现出了很高的相关性,接近 GPT - 4 的性能。
我们在以下数据集上进行了成对比较测试:
- [Auto - J 成对测试数据](https://github.com/GAIR - NLP/auto - j#pairwise - response - comparison)
- HHH - 对齐
- MT - bench 人工判断
所有结果均以成对比较准确率(一致性)报告。
Auto - J 成对测试数据性能
模型 | 总结 | 考试 | 代码 | 重写 | 创意写作 | 功能写作 | 交流 | NLP | 总体 |
---|---|---|---|---|---|---|---|---|---|
闭源模型 | |||||||||
ChatGPT | 33.3 | 40.3 | 36.6 | 31.6 | 48.2 | 40.4 | 47.6 | 45.8 | 42.7 |
Claude - 2 | 30.6 | 36.1 | 41.7 | 34.2 | 48.1 | 42.5 | 40.6 | 48.5 | 42.4 |
GPT - 4 | 59.7 | 51.4 | 69.2 | 58.3 | 66.7 | 60.4 | 58.3 | 65.2 | 61.9 |
开源模型 | |||||||||
SteamSHP | 33.3 | 29.2 | 26.7 | 33.3 | 40.7 | 31.3 | 51.4 | 51.9 | 40.6 |
PandaLM | 29.2 | 33.3 | 31.7 | 23.3 | 43.5 | 32.9 | 44.8 | 48.9 | 38.9 |
LLaMA - 2 - chat - 13B | 20.8 | 27.8 | 19.2 | 20 | 31.5 | 27.5 | 35.8 | 31.8 | 29 |
Vicuna - 13B - v1.5 | 30.6 | 23.6 | 35 | 28.3 | 36.1 | 37.5 | 45.5 | 39.8 | 37.3 |
WizardLM - 13B - v1.2 | 22.2 | 20.8 | 32.5 | 19.2 | 28.7 | 25.4 | 29.2 | 33 | 27.8 |
LLAMA - 2 - chat - 70B | 34.7 | 33.3 | 36.7 | 35.8 | 51.4 | 54.2 | 47.2 | 47.7 | 45.9 |
AUTO - J (13b) | 45.8 | 38.9 | 59.2 | 47.5 | 54.6 | 57.1 | 58 | 57.6 | 54.8 |
UltraRM (13b) | 56.94 | 43.06 | 55.0 | 53.33 | 67.13 | 64.17 | 56.25 | 59.85 | 59.85 |
PairRM (0.4b) | 56.94 | 52.78 | 58.33 | 55.83 | 61.57 | 59.17 | 57.64 | 62.5 | 59.05 |
HHH - 对齐和 MT - bench 人工判断
评估器 LM | HHH 对齐 | MT - bench 人工判断 | ||||
---|---|---|---|---|---|---|
帮助 | 伤害 | 诚实 | 其他 | 总平均 | 人类偏好 | |
随机 | 50 | 50 | 50 | 50 | 50 | 34.26 |
STANFORDNLP 奖励模型 | 69.49 | 60.34 | 52.46 | 51.16 | 58.82 | 44.79 |
ALMOST 奖励模型 | 74.58 | 67.24 | 78.69 | 86.05 | 76.02 | 49.9 |
LLAMA2 - CHAT 7B | 66.1 | 81.03 | 70.49 | 74.42 | 72.85 | 51.78 |
LLAMA2 - CHAT 13B | 74.58 | 87.93 | 55.74 | 79.07 | 73.76 | 52.34 |
LLAMA2 - CHAT 70B | 66.1 | 89.66 | 67.21 | 74.42 | 74.21 | 53.67 |
LLAMA2 - CHAT 13B + COARSE | 68.74 | 68.97 | 65.57 | 67.44 | 67.42 | 46.89 |
GPT - 3.5 - TURBO - 0613 | 76.27 | 87.93 | 67.21 | 86.05 | 78.73 | 57.12 |
PROMETHEUS 7B | 69.49 | 84.48 | 78.69 | 90.7 | 80.09 | 55.14 |
PROMETHEUS 13B | 81.36 | 82.76 | 75.41 | 76.74 | 79.19 | 57.72 |
UltraRM (13B) | 86.44 | 79.31 | 81.97 | 88.37 | 83.71 | 56 |
PairRM (0.4B) | 84.75 | 84.48 | 80.33 | 90.7 | 84.62 | 59 |
GPT - 4 - 0613 | 91.53 | 93.1 | 85.25 | 83.72 | 88.69 | 63.87 |
虽然 PairRM 是一个基于 DeBERTa 的极小模型(0.4B),但其成对比较一致性性能接近 GPT - 4 的性能!
这归因于两个原因:
- 我们的 PairRM 专门设计了用于成对比较的模型架构,通过双向注意力机制实现(更多细节请参阅 LLM - blender 论文)。
- 它在高质量、大规模的人类偏好注释数据上进行了训练(请参阅此 Hugging Face 页面上的训练数据集列表)。
📄 许可证
本项目采用 MIT 许可证。
📖 引用与致谢
如果您在研究中使用了 PairRM,请引用 LLM - blender:
@inproceedings{llm-blender-2023,
title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
year = "2023"
}



