模型概述
模型特點
模型能力
使用案例
🚀 成對獎勵模型(PairRM)
PairRM 是一個用於大語言模型(LLM)的成對獎勵模型,它以指令和一對輸出候選作為輸入,輸出每個候選的得分以衡量其相對質量。該模型可用於對候選輸出進行排序、評估 LLM 質量、增強解碼效果,還能輔助基於人類反饋的強化學習(RLHF)方法對指令微調後的 LLM 進行進一步對齊。
🚀 快速開始
這是與 Hugging Face 兼容的 llm-blender/PairRM 版本,可使用 DebertaV2PairRM
直接加載:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from llm_blender.pair_ranker.pairrm import DebertaV2PairRM
from transformers import AutoTokenizer
from typing import List
pairrm = DebertaV2PairRM.from_pretrained("llm-blender/PairRM-hf", device_map="cuda:0").eval()
tokenizer = AutoTokenizer.from_pretrained('llm-blender/PairRM-hf')
source_prefix = "<|source|>"
cand1_prefix = "<|candidate1|>"
cand2_prefix = "<|candidate2|>"
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
def tokenize_pair(sources:List[str], candidate1s:List[str], candidate2s:List[str], source_max_length=1224, candidate_max_length=412):
ids = []
assert len(sources) == len(candidate1s) == len(candidate2s)
max_length = source_max_length + 2 * candidate_max_length
for i in range(len(sources)):
source_ids = tokenizer.encode(source_prefix + sources[i], max_length=source_max_length, truncation=True)
candidate_max_length = (max_length - len(source_ids)) // 2
candidate1_ids = tokenizer.encode(cand1_prefix + candidate1s[i], max_length=candidate_max_length, truncation=True)
candidate2_ids = tokenizer.encode(cand2_prefix + candidate2s[i], max_length=candidate_max_length, truncation=True)
ids.append(source_ids + candidate1_ids + candidate2_ids)
encodings = tokenizer.pad({"input_ids": ids}, return_tensors="pt", padding="max_length", max_length=max_length)
return encodings
encodings = tokenize_pair(inputs, candidates_A, candidates_B)
encodings = {k:v.to(pairrm.device) for k,v in encodings.items()}
outputs = pairrm(**encodings)
logits = outputs.logits.tolist()
comparison_results = outputs.logits > 0
print(logits)
# [1.9003021717071533, -1.2547134160995483]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input
你也可以將 DebertaV2PairRM
代碼複製到本地文件,而不是從 llm-blender
包中導入。
上述代碼與使用原始 LLM-blender 包裝器的代碼產生的結果完全相同:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import llm_blender
blender = llm_blender.Blender()
# Load Ranker
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
logits = blender.compare(inputs, candidates_A, candidates_B, return_logits=True, mode="[A,B]")
comparison_results = logits > 0
print(logits)
# [ 1.9 -1.255]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input
我們仍然建議使用 llm-blender 包裝器來使用 PairRM,因為已經實現了許多有用的應用函數來支持各種場景,例如排名、對話比較、最佳 n 採樣等。
你還可以輕鬆比較兩個對話,如下所示:
def tokenize_conv_pair(convAs: List[str], convBs: List[str]):
"""Compare two conversations by takeing USER turns as inputs and ASSISTANT turns as candidates
Multi-turn conversations comparison is also supportted.
a conversation format is:
```python
[
{
"content": "hello",
"role": "USER"
},
{
"content": "hi",
"role": "ASSISTANT"
},
...
]
```
Args:
convAs (List[List[dict]]): List of conversations
convAs (List[List[dict]]): List of conversations
"""
for c in convAs + convBs:
assert len(c) % 2 == 0, "Each conversation must have even number of turns"
assert all([c[i]['role'] == 'USER' for i in range(0, len(c), 2)]), "Each even turn must be USER"
assert all([c[i]['role'] == 'ASSISTANT' for i in range(1, len(c), 2)]), "Each odd turn must be ASSISTANT"
# check conversations correctness
assert len(convAs) == len(convBs), "Number of conversations must be the same"
for c_a, c_b in zip(convAs, convBs):
assert len(c_a) == len(c_b), "Number of turns in each conversation must be the same"
assert all([c_a[i]['content'] == c_b[i]['content'] for i in range(0, len(c_a), 2)]), "USER turns must be the same"
instructions = ["Finish the following coversation in each i-th turn by filling in <Response i> with your response."] * len(convAs)
inputs = [
"\n".join([
"USER: " + x[i]['content'] +
f"\nAssistant: <Response {i//2+1}>" for i in range(0, len(x), 2)
]) for x in convAs
]
cand1_texts = [
"\n".join([
f"<Response {i//2+1}>: " + x[i]['content'] for i in range(1, len(x), 2)
]) for x in convAs
]
cand2_texts = [
"\n".join([
f"<Response {i//2+1}>: " + x[i]['content'] for i in range(1, len(x), 2)
]) for x in convBs
]
inputs = [inst + inp for inst, inp in zip(instructions, inputs)]
encodings = tokenize_pair(inputs, cand1_texts, cand2_texts)
return encodings
✨ 主要特性
- 高效比較:與其他分別對每個候選進行編碼和評分的獎勵模型不同,PairRM 以一對候選為輸入,並排比較它們,以識別它們之間的細微差異。
- 輕量級模型:基於
microsoft/deberta-v3-large
,模型大小僅為 0.4B,但性能接近 GPT - 4。 - 多場景支持:可用於對候選輸出進行排序、評估 LLM 質量、增強解碼效果,還能輔助 RLHF 方法對指令微調後的 LLM 進行進一步對齊。
📦 安裝指南
- 首先安裝
llm-blender
pip install git+https://github.com/yuchenlin/LLM-Blender.git
- 然後加載 PairRM:
import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load PairRM
💻 使用示例
基礎用法
用例 1:根據指令比較/排名輸出候選
- 對候選響應列表進行排名
inputs = ["hello, how are you!", "I love you!"]
candidates_texts = [["get out!", "hi! I am fine, thanks!", "bye!"],
["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=1)
# ranks is a list of ranks
# ranks[i][j] represents the ranks of candidate-j for input-i
"""
ranks -->
array([[3, 1, 2], # it means "hi! I am fine, thanks!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd.
[1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
dtype=int32)
"""
- 直接比較兩個候選響應
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where comparison_results[i] denotes
# whether candidates_A[i] is better than candidates_B[i] for inputs[i]
# Example: comparison_results[0]--> True
比較兩個多輪對話。
conv1 = [
{
"content": "hello",
"role": "USER"
},
{
"content": "[assistant1‘s response 1]",
"role": "ASSISTANT"
},
...
]
conv2 = [
{
"content": "hello",
"role": "USER"
},
{
"content": "[assistant2's response 1]",
"role": "ASSISTANT"
},
...
]
comparison_results = blender.compare_conversations([conv1], [conv2])
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2
高級用法
用例 2:最佳 n 採樣(解碼增強)
最佳 n 採樣,也稱為拒絕採樣,是一種通過選擇獎勵模型排名最高的響應來提高響應質量的策略(更多信息請參閱 OpenAI WebGPT 第 3.2 節 和 OpenAI 博客)。使用 PairRM 進行最佳 n 採樣是一種非常簡單的方法,只需對推理代碼進行少量更改即可改進你的 LLM:
# loading models
import llm_blender
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
system_message = {"role": "system", "content": "You are a friendly chatbot."}
# formatting your inputs
inputs = ["can you tell me a joke about OpenAI?"]
messages = [[system_message, {"role": "user", "content": _input}] for _input in inputs]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]
# Conventional generation method
input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids
sampled_outputs = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
print(tokenizer.decode(sampled_outputs[0][len(input_ids[0]):], skip_special_tokens=False))
# --> The output could be a bad case such as a very short one, e.g., `Sure`
# PairRM for best-of-n sampling
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)
print("### Prompt:\n", prompts[0])
print("### best-of-n generations:\n", outputs[0])
# --> The output will be much more stable and consistently better than single sampling, for example:
"""
Sure, here's a joke about OpenAI:
Why did OpenAI decide to hire a mime as their new AI researcher?
Because they wanted someone who could communicate complex ideas without making a sound!
(Note: This is a joke, not a reflection of OpenAI's actual hiring practices.)
"""
用例 3:RLHF
PairRM 在各種高質量、大規模的帶有人類偏好註釋的數據集上進行了訓練,在極小的模型尺寸(0.4B)下與人類偏好表現出了很高的相關性,接近 GPT - 4 的性能。PairRM 將以更高效、更有效的方式幫助未來 LLM 的對齊。通過 blender.compare()
函數,你可以將 PairRM 應用於流行的 RLHF 工具包,如 trl。
🔥 查看我們的示例 Jupyter 筆記本用法的更多詳細信息:blender_usage.ipynb
📚 詳細文檔
- GitHub:https://github.com/yuchenlin/LLM-Blender
- 論文:https://arxiv.org/abs/2306.02561
- Space 演示:https://huggingface.co/spaces/llm-blender/LLM-Blender
🔧 技術細節
上下文長度
屬性 | 詳情 |
---|---|
模型類型 | PairRM |
輸入最大長度 | 1224 |
候選最大長度 | 412 |
總最大長度 | 2048 |
訓練數據集
- openai/summarize_from_feedback
- openai/webgpt_comparisons
- Dahoas/instruct-synthetic-prompt-responses
- Anthropic/hh-rlhf
- lmsys/chatbot_arena_conversations
- openbmb/UltraFeedback
性能
PairRM 在各種高質量、大規模的帶有人類偏好註釋的數據集上進行了訓練,在極小的模型尺寸(0.4B)下與人類偏好表現出了很高的相關性,接近 GPT - 4 的性能。
我們在以下數據集上進行了成對比較測試:
- [Auto - J 成對測試數據](https://github.com/GAIR - NLP/auto - j#pairwise - response - comparison)
- HHH - 對齊
- MT - bench 人工判斷
所有結果均以成對比較準確率(一致性)報告。
Auto - J 成對測試數據性能
模型 | 總結 | 考試 | 代碼 | 重寫 | 創意寫作 | 功能寫作 | 交流 | NLP | 總體 |
---|---|---|---|---|---|---|---|---|---|
閉源模型 | |||||||||
ChatGPT | 33.3 | 40.3 | 36.6 | 31.6 | 48.2 | 40.4 | 47.6 | 45.8 | 42.7 |
Claude - 2 | 30.6 | 36.1 | 41.7 | 34.2 | 48.1 | 42.5 | 40.6 | 48.5 | 42.4 |
GPT - 4 | 59.7 | 51.4 | 69.2 | 58.3 | 66.7 | 60.4 | 58.3 | 65.2 | 61.9 |
開源模型 | |||||||||
SteamSHP | 33.3 | 29.2 | 26.7 | 33.3 | 40.7 | 31.3 | 51.4 | 51.9 | 40.6 |
PandaLM | 29.2 | 33.3 | 31.7 | 23.3 | 43.5 | 32.9 | 44.8 | 48.9 | 38.9 |
LLaMA - 2 - chat - 13B | 20.8 | 27.8 | 19.2 | 20 | 31.5 | 27.5 | 35.8 | 31.8 | 29 |
Vicuna - 13B - v1.5 | 30.6 | 23.6 | 35 | 28.3 | 36.1 | 37.5 | 45.5 | 39.8 | 37.3 |
WizardLM - 13B - v1.2 | 22.2 | 20.8 | 32.5 | 19.2 | 28.7 | 25.4 | 29.2 | 33 | 27.8 |
LLAMA - 2 - chat - 70B | 34.7 | 33.3 | 36.7 | 35.8 | 51.4 | 54.2 | 47.2 | 47.7 | 45.9 |
AUTO - J (13b) | 45.8 | 38.9 | 59.2 | 47.5 | 54.6 | 57.1 | 58 | 57.6 | 54.8 |
UltraRM (13b) | 56.94 | 43.06 | 55.0 | 53.33 | 67.13 | 64.17 | 56.25 | 59.85 | 59.85 |
PairRM (0.4b) | 56.94 | 52.78 | 58.33 | 55.83 | 61.57 | 59.17 | 57.64 | 62.5 | 59.05 |
HHH - 對齊和 MT - bench 人工判斷
評估器 LM | HHH 對齊 | MT - bench 人工判斷 | ||||
---|---|---|---|---|---|---|
幫助 | 傷害 | 誠實 | 其他 | 總平均 | 人類偏好 | |
隨機 | 50 | 50 | 50 | 50 | 50 | 34.26 |
STANFORDNLP 獎勵模型 | 69.49 | 60.34 | 52.46 | 51.16 | 58.82 | 44.79 |
ALMOST 獎勵模型 | 74.58 | 67.24 | 78.69 | 86.05 | 76.02 | 49.9 |
LLAMA2 - CHAT 7B | 66.1 | 81.03 | 70.49 | 74.42 | 72.85 | 51.78 |
LLAMA2 - CHAT 13B | 74.58 | 87.93 | 55.74 | 79.07 | 73.76 | 52.34 |
LLAMA2 - CHAT 70B | 66.1 | 89.66 | 67.21 | 74.42 | 74.21 | 53.67 |
LLAMA2 - CHAT 13B + COARSE | 68.74 | 68.97 | 65.57 | 67.44 | 67.42 | 46.89 |
GPT - 3.5 - TURBO - 0613 | 76.27 | 87.93 | 67.21 | 86.05 | 78.73 | 57.12 |
PROMETHEUS 7B | 69.49 | 84.48 | 78.69 | 90.7 | 80.09 | 55.14 |
PROMETHEUS 13B | 81.36 | 82.76 | 75.41 | 76.74 | 79.19 | 57.72 |
UltraRM (13B) | 86.44 | 79.31 | 81.97 | 88.37 | 83.71 | 56 |
PairRM (0.4B) | 84.75 | 84.48 | 80.33 | 90.7 | 84.62 | 59 |
GPT - 4 - 0613 | 91.53 | 93.1 | 85.25 | 83.72 | 88.69 | 63.87 |
雖然 PairRM 是一個基於 DeBERTa 的極小模型(0.4B),但其成對比較一致性性能接近 GPT - 4 的性能!
這歸因於兩個原因:
- 我們的 PairRM 專門設計了用於成對比較的模型架構,通過雙向注意力機制實現(更多細節請參閱 LLM - blender 論文)。
- 它在高質量、大規模的人類偏好註釋數據上進行了訓練(請參閱此 Hugging Face 頁面上的訓練數據集列表)。
📄 許可證
本項目採用 MIT 許可證。
📖 引用與致謝
如果您在研究中使用了 PairRM,請引用 LLM - blender:
@inproceedings{llm-blender-2023,
title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
year = "2023"
}



