PairRM Open-Source Paired Reward Model - Free Deployment, Supports Output Ranking and Selection in Multiple Scenarios

Pairrm

Developed by llm-blender

PairRM is an efficient pairwise reward model for comparing and ranking output candidates from large language models, supporting various applications such as RLHF and Best-N sampling.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Pairwise Comparison Evaluation #Efficient Reward Model #Best-N Sampling

Downloads 6,004

Release Time : 11/6/2023

Model Overview

PairRM takes an instruction and a pair of output candidates, scoring each to measure relative quality. It can be used to rank candidate outputs, enhance decoding, and align instruction-tuned LLMs through RLHF methods.

Model Features

Pairwise Comparison

Compares a pair of candidates side-by-side to identify subtle differences and improve evaluation accuracy.

Efficient Model

Based on the 0.4B-parameter deberta-v3-large, it offers fast inference with low resource consumption.

Multi-Dataset Training

Trained on six human preference datasets, covering diverse scenarios.

Versatile Applications

Supports ranking, Best-N sampling, RLHF, and other application scenarios.

Model Capabilities

Text Generation Evaluation

Output Candidate Ranking

RLHF Support

Decoding Enhancement

Use Cases

LLM Evaluation

Candidate Output Ranking

Ranks multiple LLM-generated candidate outputs to select the optimal result.

Improves output quality, aligning closely with human preferences.

LLM Training

RLHF Alignment

Guides LLM reinforcement learning through PairRM's scoring.

Enhances alignment between LLMs and human preferences.

Best-N Sampling

Generates multiple candidates and uses PairRM to select the best output.

Consistently improves generation quality and avoids low-quality outputs.

🚀 Pairwise Reward Model for LLMs (PairRM) from LLM-Blender

Pairwise Reward Model (PairRM) is designed to take an instruction and a pair of output candidates as input, scoring each candidate to measure their relative quality. It can be used for ranking candidate outputs, evaluating LLMs, enhancing decoding, and aligning instruction - tuned LLMs with RLHF methods.

Github: https://github.com/yuchenlin/LLM-Blender
Paper: https://arxiv.org/abs/2306.02561
Space Demo: https://huggingface.co/spaces/llm-blender/LLM-Blender

🚀 Quick Start

Pairwise Reward Model (PairRM) is a powerful tool for evaluating and enhancing the performance of LLMs. It offers multiple use - cases such as ranking candidate outputs, improving decoding, and facilitating RLHF.

✨ Features

Relative Quality Assessment: PairRM takes a pair of output candidates and provides a score for each, measuring their relative quality.
Efficient Evaluation: It can be used as an LLM evaluator in a local environment, efficiently assessing the quality of LLMs.
Decoding Enhancement: Through best - of - n sampling, PairRM can enhance the decoding process.
RLHF Support: Trained on high - quality datasets, it can be applied to popular RLHF toolkits.
High Efficiency: Based on microsoft/deberta - v3 - large, it has a small model size of 0.4B.

📦 Installation

Install `llm - blender`

pip install git+https://github.com/yuchenlin/LLM-Blender.git

Load PairRM

import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load PairRM

💻 Usage Examples

Basic Usage

Comparing/Ranking output candidates given an instruction

inputs = ["hello, how are you!", "I love you!"]
candidates_texts = [["get out!", "hi! I am fine, thanks!", "bye!"], 
                    ["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=1)
# ranks is a list of ranks
# ranks[i][j] represents the ranks of candidate-j for input-i
"""
ranks -->
array([[3, 1, 2], # it means "hi! I am fine, thanks!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd. 
       [1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
       dtype=int32) 

"""

Directly comparing two candidate responses

inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where comparison_results[i] denotes
       # whether candidates_A[i] is better than candidates_B[i] for inputs[i]
# Example: comparison_results[0]--> True

Comparing two multi - turn conversations.

conv1 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "[assistant1‘s response 1]",
        "role": "ASSISTANT"
    },
    ...
]
conv2 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "[assistant2's response 1]",
        "role": "ASSISTANT"
    },
    ...
]
comparison_results = blender.compare_conversations([conv1], [conv2])
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2

Advanced Usage

Best - of - n Sampling (Decoding Enhancment)

# loading models 
import llm_blender
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
system_message = {"role": "system", "content": "You are a friendly chatbot."}

# formatting your inputs 
inputs = ["can you tell me a joke about OpenAI?"]
messages = [[system_message, {"role": "user", "content": _input}] for _input in inputs]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]

# Conventional generation method 
input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids
sampled_outputs = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
print(tokenizer.decode(sampled_outputs[0][len(input_ids[0]):], skip_special_tokens=False))
# --> The output could be a bad case such as a very short one, e.g., `Sure` 

# PairRM for best - of - n sampling 
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)

print("### Prompt:\n", prompts[0])
print("### best - of - n generations:\n", outputs[0])
# --> The output will be much more stable and consistently better than single sampling, for example: 
""" 
Sure, here's a joke about OpenAI:

Why did OpenAI decide to hire a mime as their new AI researcher?

Because they wanted someone who could communicate complex ideas without making a sound!

(Note: This is a joke, not a reflection of OpenAI's actual hiring practices.)
"""

RLHF

PairRM has been trained on various high - quality and large - scale datasets with human preference annotations and shown great correlation with human preferences with an extremely small model size (0.4B), approaching the performance of GPT - 4. With a blender.compare() function, you can apply PairRM to popular RLHF toolkits such as trl.

🔥 Check more details on our example jupyter notebook usage: blender_usage.ipynb

📚 Documentation

News

Check out our results on AlpacaEval leaderboard: Twitter Leaderboard

Introduction

Pairwise Reward Model (PairRM) takes an instruction and a pair of output candidates as the input, and output a score for each candidate to measure their relative quality. It can be used for multiple purposes including ranking candidate outputs, enhancing decoding, and aligning instruction - tuned LLMs with RLHF methods.

Unlike other RMs that encode and score each candidate respectively, PairRM compares a pair of candidates side - by - side to identify the subtle differences between them.

Statistics

Context length

Property	Details
Model Type	Pairwise Reward Model (PairRM)
Source max length	1224
Candidate max length	412
Total max length	2048

Training Datasets

openai/summarize_from_feedback
openai/webgpt_comparisons
[Dahoas/synthetic - instruct - gptj - pairwise](https://huggingface.co/datasets/Dahoas/synthetic - instruct - gptj - pairwise)
[Anthropic/hh - rlhf](https://huggingface.co/datasets/Anthropic/hh - rlhf)
lmsys/chatbot_arena_conversations
openbmb/UltraFeedback

Performance

Auto - J Pairwise test data performance

Property	Summ	Exam	Code	Rewriting	Crea W	Func W	Comm	NLP	Overall
ChatGPT	33.3	40.3	36.6	31.6	48.2	40.4	47.6	45.8	42.7
Claude - 2	30.6	36.1	41.7	34.2	48.1	42.5	40.6	48.5	42.4
GPT - 4	59.7	51.4	69.2	58.3	66.7	60.4	58.3	65.2	61.9
SteamSHP	33.3	29.2	26.7	33.3	40.7	31.3	51.4	51.9	40.6
PandaLM	29.2	33.3	31.7	23.3	43.5	32.9	44.8	48.9	38.9
LLaMA - 2 - Chat - 13B	20.8	27.8	19.2	20	31.5	27.5	35.8	31.8	29
Vicuna - 13B - v1.5	30.6	23.6	35	28.3	36.1	37.5	45.5	39.8	37.3
WizardLM - 13B - v1.2	22.2	20.8	32.5	19.2	28.7	25.4	29.2	33	27.8
LLAMA - 2 - chat - 70B	34.7	33.3	36.7	35.8	51.4	54.2	47.2	47.7	45.9
AUTO - J (13b)	45.8	38.9	59.2	47.5	54.6	57.1	58	57.6	54.8
UltraRM (13b)	56.94	43.06	55.0	53.33	67.13	64.17	56.25	59.85	59.85
PairRM (0.4b)	56.94	52.78	58.33	55.83	61.57	59.17	57.64	62.5	59.05

HHH - Alignment and MT - bench human judgements

Property	HHH ALIGNMENT - Help	HHH ALIGNMENT - Harm	HHH ALIGNMENT - Hon	HHH ALIGNMENT - Other	HHH ALIGNMENT - Total Avg	MT BENCH HUMAN JUDG - Human Preference
RANDOM	50	50	50	50	50	34.26
STANFORDNLP REWARD MODEL	69.49	60.34	52.46	51.16	58.82	44.79
ALMOST REWARD MODEL	74.58	67.24	78.69	86.05	76.02	49.9
LLAMA2 - CHAT 7B	66.1	81.03	70.49	74.42	72.85	51.78
LLAMA2 - CHAT 13B	74.58	87.93	55.74	79.07	73.76	52.34
LLAMA2 - CHAT 70B	66.1	89.66	67.21	74.42	74.21	53.67
LLAMA2 - CHAT 13B+COARSE	68.74	68.97	65.57	67.44	67.42	46.89
GPT - 3.5 - TURBO - 0613	76.27	87.93	67.21	86.05	78.73	57.12
PROMETHEUS 7B	69.49	84.48	78.69	90.7	80.09	55.14
PROMETHEUS 13B	81.36	82.76	75.41	76.74	79.19	57.72
UltraRM (13B)	86.44	79.31	81.97	88.37	83.71	56
PairRM (0.4B)	84.75	84.48	80.33	90.7	84.62	59
GPT - 4 - 0613	91.53	93.1	85.25	83.72	88.69	63.87

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご