PairRM - hfオープンソースペアリング報酬モデル - 大規模言語モデルの出力品質を効率的に評価する

ホーム

Pairrm Hf

llm-blenderによって開発

PairRMは、大規模言語モデルの出力品質を比較および評価するための効率的なペアリング報酬モデルです。DebertaV3アーキテクチャに基づいており、候補応答間の微妙な差異を識別するように特別に設計されています。

大規模言語モデル

Transformers

英語オープンソースライセンス:MIT #ペアリング報酬モデル #LLM評価器 #候補再ソート

ダウンロード数 631

リリース時間 : 1/5/2024

モデル概要

PairRMは、軽量で効率的な報酬モデルで、2つの候補応答の相対的な品質を比較するために使用されます。候補のソート、対話比較、ベストnサンプリングなど、さまざまなアプリケーションシナリオをサポートしています。

モデル特徴

ペア比較

一対の候補応答を同時に評価し、微妙な品質差を識別できる

効率的で軽量

0.4Bのパラメータを持つDebertaV3モデルに基づいており、計算効率が高い

多様なシナリオに適用可能

ソート、対話比較、ベストnサンプリングなど、さまざまなアプリケーションシナリオをサポートする

複数のデータセットでの学習

6つの人間の嗜好データセットで学習されており、評価結果が信頼できる

モデル能力

テキスト品質評価

応答ソート

対話比較

報酬スコア付け

使用事例

大規模言語モデル評価

候補応答のソート

複数のLLMが生成した候補応答を品質でソートする

最適な応答を識別し、出力品質を向上させる

対話システムの最適化

多輪対話比較

2つの対話アシスタントの全体的なパフォーマンスを比較する

より優れた対話戦略を選択するのに役立つ

デコード強化

ベストnサンプリング

複数のサンプルから最も高い評価を得た応答を選択する

最終出力の品質を向上させる

🚀 ペアリング報酬モデル（PairRM）

PairRMは、大規模言語モデル（LLM）用のペアリング報酬モデルです。このモデルは、命令と一対の出力候補を入力として受け取り、各候補の相対的な品質を測定するためのスコアを出力します。このモデルは、候補出力のランキング付け、LLMの品質評価、デコードの強化、および人間のフィードバックに基づく強化学習（RLHF）方法による命令微調整後のLLMのさらなるアライメント支援に使用できます。

🚀 クイックスタート

これは、Hugging Faceと互換性のある llm-blender/PairRM バージョンです。DebertaV2PairRM を使用して直接ロードできます。

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from llm_blender.pair_ranker.pairrm import DebertaV2PairRM
from transformers import AutoTokenizer
from typing import List
pairrm = DebertaV2PairRM.from_pretrained("llm-blender/PairRM-hf", device_map="cuda:0").eval()
tokenizer = AutoTokenizer.from_pretrained('llm-blender/PairRM-hf')
source_prefix = "<|source|>"
cand1_prefix = "<|candidate1|>"
cand2_prefix = "<|candidate2|>"
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
def tokenize_pair(sources:List[str], candidate1s:List[str], candidate2s:List[str], source_max_length=1224, candidate_max_length=412):
    ids = []
    assert len(sources) == len(candidate1s) == len(candidate2s)
    max_length = source_max_length + 2 * candidate_max_length
    for i in range(len(sources)):
        source_ids = tokenizer.encode(source_prefix + sources[i], max_length=source_max_length, truncation=True)
        candidate_max_length = (max_length - len(source_ids)) // 2
        candidate1_ids = tokenizer.encode(cand1_prefix + candidate1s[i], max_length=candidate_max_length, truncation=True)
        candidate2_ids = tokenizer.encode(cand2_prefix + candidate2s[i], max_length=candidate_max_length, truncation=True)
        ids.append(source_ids + candidate1_ids + candidate2_ids)
    encodings = tokenizer.pad({"input_ids": ids}, return_tensors="pt", padding="max_length", max_length=max_length)
    return encodings

encodings = tokenize_pair(inputs, candidates_A, candidates_B)
encodings = {k:v.to(pairrm.device) for k,v in encodings.items()}
outputs = pairrm(**encodings)
logits = outputs.logits.tolist()
comparison_results = outputs.logits > 0
print(logits)
# [1.9003021717071533, -1.2547134160995483]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input

また、DebertaV2PairRM のコードをローカルファイルにコピーして、llm-blender パッケージからインポートする代わりに使用することもできます。

上記のコードは、元のLLM-blenderラッパーを使用したコードとまったく同じ結果を生成します。

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import llm_blender
blender = llm_blender.Blender()
# Load Ranker
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
logits = blender.compare(inputs, candidates_A, candidates_B, return_logits=True, mode="[A,B]")
comparison_results = logits > 0
print(logits)
# [ 1.9   -1.255]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input

PairRMを使用する場合は、多くの有用なアプリケーション関数が実装されており、ランキング、会話比較、ベストnサンプリングなどのさまざまなシーンをサポートしているため、llm-blenderラッパーの使用をお勧めします。

また、以下のように2つの会話を簡単に比較することもできます。

def tokenize_conv_pair(convAs: List[str], convBs: List[str]):
    """Compare two conversations by takeing USER turns as inputs and ASSISTANT turns as candidates
        Multi-turn conversations comparison is also supportted.
        a conversation format is:
        ```python
        [
            {
                "content": "hello",
                "role": "USER"
            },
            {
                "content": "hi",
                "role": "ASSISTANT"
            },
            ...
        ]
        ```
    Args:
        convAs (List[List[dict]]): List of conversations
        convAs (List[List[dict]]): List of conversations
    """

    for c in convAs + convBs:
        assert len(c) % 2 == 0, "Each conversation must have even number of turns"
        assert all([c[i]['role'] == 'USER' for i in range(0, len(c), 2)]), "Each even turn must be USER"
        assert all([c[i]['role'] == 'ASSISTANT' for i in range(1, len(c), 2)]), "Each odd turn must be ASSISTANT"
    # check conversations correctness
    assert len(convAs) == len(convBs), "Number of conversations must be the same"
    for c_a, c_b in zip(convAs, convBs):
        assert len(c_a) == len(c_b), "Number of turns in each conversation must be the same"
        assert all([c_a[i]['content'] == c_b[i]['content'] for i in range(0, len(c_a), 2)]), "USER turns must be the same"
    
    instructions = ["Finish the following coversation in each i-th turn by filling in <Response i> with your response."] * len(convAs)
    inputs = [
        "\n".join([
            "USER: " + x[i]['content'] +
            f"\nAssistant: <Response {i//2+1}>" for i in range(0, len(x), 2)
        ]) for x in convAs
    ]
    cand1_texts = [
        "\n".join([
            f"<Response {i//2+1}>: " + x[i]['content'] for i in range(1, len(x), 2)
        ]) for x in convAs
    ]
    cand2_texts = [
        "\n".join([
            f"<Response {i//2+1}>: " + x[i]['content'] for i in range(1, len(x), 2)
        ]) for x in convBs
    ]
    inputs = [inst + inp for inst, inp in zip(instructions, inputs)]
    encodings = tokenize_pair(inputs, cand1_texts, cand2_texts)
    return encodings

✨ 主な機能

効率的な比較：他の報酬モデルが各候補を個別にエンコードしてスコア付けするのとは異なり、PairRMは一対の候補を入力として受け取り、それらを並列に比較して、それらの間の微妙な違いを識別します。
軽量モデル：microsoft/deberta-v3-large をベースにしており、モデルサイズはわずか0.4Bですが、性能はGPT - 4に近いです。
多様なシーンをサポート：候補出力のランキング付け、LLMの品質評価、デコードの強化、およびRLHF方法による命令微調整後のLLMのさらなるアライメント支援に使用できます。

📦 インストール

まず、llm-blender をインストールします。

pip install git+https://github.com/yuchenlin/LLM-Blender.git

次に、PairRMをロードします。

import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load PairRM

💻 使用例

基本的な使用法

例1：命令に基づいて出力候補を比較/ランキング付けする

候補応答リストをランキング付けする

inputs = ["hello, how are you!", "I love you!"]
candidates_texts = [["get out!", "hi! I am fine, thanks!", "bye!"], 
                    ["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=1)
# ranks is a list of ranks
# ranks[i][j] represents the ranks of candidate-j for input-i
"""
ranks -->
array([[3, 1, 2], # it means "hi! I am fine, thanks!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd. 
       [1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
       dtype=int32) 

"""

2つの候補応答を直接比較する

inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where comparison_results[i] denotes
       # whether candidates_A[i] is better than candidates_B[i] for inputs[i]
# Example: comparison_results[0]--> True

2つの多ターン会話を比較する。

conv1 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "[assistant1‘s response 1]",
        "role": "ASSISTANT"
    },
    ...
]
conv2 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "[assistant2's response 1]",
        "role": "ASSISTANT"
    },
    ...
]
comparison_results = blender.compare_conversations([conv1], [conv2])
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2

高度な使用法

例2：ベストnサンプリング（デコード強化）

ベストnサンプリング は、拒否サンプリングとも呼ばれ、報酬モデルによってランク付けが最も高い応答を選択することで、応答の品質を向上させる戦略です（詳細については、OpenAI WebGPT セクション3.2 および OpenAIブログを参照してください）。PairRMを使用したベストnサンプリングは非常に簡単で、推論コードをわずかに変更するだけで、LLMを改善できます。

# loading models 
import llm_blender
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
system_message = {"role": "system", "content": "You are a friendly chatbot."}

# formatting your inputs 
inputs = ["can you tell me a joke about OpenAI?"]
messages = [[system_message, {"role": "user", "content": _input}] for _input in inputs]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]

# Conventional generation method 
input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids
sampled_outputs = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
print(tokenizer.decode(sampled_outputs[0][len(input_ids[0]):], skip_special_tokens=False))
# --> The output could be a bad case such as a very short one, e.g., `Sure` 

# PairRM for best-of-n sampling 
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)

print("### Prompt:\n", prompts[0])
print("### best-of-n generations:\n", outputs[0])
# --> The output will be much more stable and consistently better than single sampling, for example: 
""" 
Sure, here's a joke about OpenAI:

Why did OpenAI decide to hire a mime as their new AI researcher?

Because they wanted someone who could communicate complex ideas without making a sound!

(Note: This is a joke, not a reflection of OpenAI's actual hiring practices.)
"""

例3：RLHF

PairRMは、人間の嗜好注釈付きのさまざまな高品質、大規模なデータセットで訓練されており、極小のモデルサイズ（0.4B）で、人間の嗜好と高い相関を示し、GPT - 4に近い性能を発揮します。PairRMは、より効率的で効果的な方法で、将来のLLMのアライメントを支援します。blender.compare() 関数を使用することで、PairRMを trl などの一般的なRLHFツールキットに適用できます。

🔥 詳細については、サンプルのJupyterノートブックの使用方法を参照してください：blender_usage.ipynb

📚 ドキュメント

GitHub：https://github.com/yuchenlin/LLM-Blender
論文：https://arxiv.org/abs/2306.02561
Spaceデモ：https://huggingface.co/spaces/llm-blender/LLM-Blender

🔧 技術詳細

コンテキスト長

属性	詳細
モデルタイプ	PairRM
入力最大長	1224
候補最大長	412
総最大長	2048

訓練データセット

性能

PairRMは、人間の嗜好注釈付きのさまざまな高品質、大規模なデータセットで訓練されており、極小のモデルサイズ（0.4B）で、人間の嗜好と高い相関を示し、GPT - 4に近い性能を発揮します。

以下のデータセットでペア比較テストを行いました。

[Auto - J ペアテストデータ](https://github.com/GAIR - NLP/auto - j#pairwise - response - comparison)
HHH - アライメント
MT - bench 人間の判断

すべての結果は、ペア比較の正解率（一致性）で報告されています。

Auto - J ペアテストデータの性能

モデル	要約	試験	コード	書き換え	創作的な文章	機能的な文章	コミュニケーション	NLP	全体
クローズドソースモデル
ChatGPT	33.3	40.3	36.6	31.6	48.2	40.4	47.6	45.8	42.7
Claude - 2	30.6	36.1	41.7	34.2	48.1	42.5	40.6	48.5	42.4
GPT - 4	59.7	51.4	69.2	58.3	66.7	60.4	58.3	65.2	61.9
オープンソースモデル
SteamSHP	33.3	29.2	26.7	33.3	40.7	31.3	51.4	51.9	40.6
PandaLM	29.2	33.3	31.7	23.3	43.5	32.9	44.8	48.9	38.9
LLaMA - 2 - chat - 13B	20.8	27.8	19.2	20	31.5	27.5	35.8	31.8	29
Vicuna - 13B - v1.5	30.6	23.6	35	28.3	36.1	37.5	45.5	39.8	37.3
WizardLM - 13B - v1.2	22.2	20.8	32.5	19.2	28.7	25.4	29.2	33	27.8
LLAMA - 2 - chat - 70B	34.7	33.3	36.7	35.8	51.4	54.2	47.2	47.7	45.9
AUTO - J (13b)	45.8	38.9	59.2	47.5	54.6	57.1	58	57.6	54.8
UltraRM (13b)	56.94	43.06	55.0	53.33	67.13	64.17	56.25	59.85	59.85
PairRM (0.4b)	56.94	52.78	58.33	55.83	61.57	59.17	57.64	62.5	59.05

HHH - アライメントとMT - benchの人間の判断

評価器LM	HHHアライメント					MT - bench人間の判断
	有用性	有害性	誠実性	その他	総平均	人間の嗜好
ランダム	50	50	50	50	50	34.26
STANFORDNLP報酬モデル	69.49	60.34	52.46	51.16	58.82	44.79
ALMOST報酬モデル	74.58	67.24	78.69	86.05	76.02	49.9
LLAMA2 - CHAT 7B	66.1	81.03	70.49	74.42	72.85	51.78
LLAMA2 - CHAT 13B	74.58	87.93	55.74	79.07	73.76	52.34
LLAMA2 - CHAT 70B	66.1	89.66	67.21	74.42	74.21	53.67
LLAMA2 - CHAT 13B + COARSE	68.74	68.97	65.57	67.44	67.42	46.89
GPT - 3.5 - TURBO - 0613	76.27	87.93	67.21	86.05	78.73	57.12
PROMETHEUS 7B	69.49	84.48	78.69	90.7	80.09	55.14
PROMETHEUS 13B	81.36	82.76	75.41	76.74	79.19	57.72
UltraRM (13B)	86.44	79.31	81.97	88.37	83.71	56
PairRM (0.4B)	84.75	84.48	80.33	90.7	84.62	59
GPT - 4 - 0613	91.53	93.1	85.25	83.72	88.69	63.87

PairRMは、DeBERTaベースの極小モデル（0.4B）ですが、ペア比較の一致性性能はGPT - 4に近いです！

これは、2つの理由によるものです。

私たちのPairRMは、ペア比較用に特別に設計されたモデルアーキテクチャを持ち、双方向注意力機構によって実現されています（詳細については、LLM - blender論文を参照してください）。
高品質、大規模な人間の嗜好注釈付きデータで訓練されています（訓練データセットのリストについては、このHugging Faceページを参照してください）。

📄 ライセンス

このプロジェクトは、MITライセンスの下で提供されています。

📖 引用と謝辞

研究でPairRMを使用した場合は、LLM - blenderを引用してください。

@inproceedings{llm-blender-2023,
    title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
    author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
    booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
    year = "2023"
}