Model Overview
Model Features
Model Capabilities
Use Cases
đ Pairwise Reward Model for LLMs (PairRM) from LLM-Blender
Pairwise Reward Model (PairRM) is designed to take an instruction and a pair of output candidates as input, scoring each candidate to measure their relative quality. It can be used for ranking candidate outputs, evaluating LLMs, enhancing decoding, and aligning instruction - tuned LLMs with RLHF methods.
- Github: https://github.com/yuchenlin/LLM-Blender
- Paper: https://arxiv.org/abs/2306.02561
- Space Demo: https://huggingface.co/spaces/llm-blender/LLM-Blender
đ Quick Start
Pairwise Reward Model (PairRM) is a powerful tool for evaluating and enhancing the performance of LLMs. It offers multiple use - cases such as ranking candidate outputs, improving decoding, and facilitating RLHF.
⨠Features
- Relative Quality Assessment: PairRM takes a pair of output candidates and provides a score for each, measuring their relative quality.
- Efficient Evaluation: It can be used as an LLM evaluator in a local environment, efficiently assessing the quality of LLMs.
- Decoding Enhancement: Through best - of - n sampling, PairRM can enhance the decoding process.
- RLHF Support: Trained on high - quality datasets, it can be applied to popular RLHF toolkits.
- High Efficiency: Based on
microsoft/deberta - v3 - large
, it has a small model size of 0.4B.
đĻ Installation
Install llm - blender
pip install git+https://github.com/yuchenlin/LLM-Blender.git
Load PairRM
import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load PairRM
đģ Usage Examples
Basic Usage
Comparing/Ranking output candidates given an instruction
inputs = ["hello, how are you!", "I love you!"]
candidates_texts = [["get out!", "hi! I am fine, thanks!", "bye!"],
["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=1)
# ranks is a list of ranks
# ranks[i][j] represents the ranks of candidate-j for input-i
"""
ranks -->
array([[3, 1, 2], # it means "hi! I am fine, thanks!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd.
[1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
dtype=int32)
"""
Directly comparing two candidate responses
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where comparison_results[i] denotes
# whether candidates_A[i] is better than candidates_B[i] for inputs[i]
# Example: comparison_results[0]--> True
Comparing two multi - turn conversations.
conv1 = [
{
"content": "hello",
"role": "USER"
},
{
"content": "[assistant1âs response 1]",
"role": "ASSISTANT"
},
...
]
conv2 = [
{
"content": "hello",
"role": "USER"
},
{
"content": "[assistant2's response 1]",
"role": "ASSISTANT"
},
...
]
comparison_results = blender.compare_conversations([conv1], [conv2])
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2
Advanced Usage
Best - of - n Sampling (Decoding Enhancment)
# loading models
import llm_blender
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
system_message = {"role": "system", "content": "You are a friendly chatbot."}
# formatting your inputs
inputs = ["can you tell me a joke about OpenAI?"]
messages = [[system_message, {"role": "user", "content": _input}] for _input in inputs]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]
# Conventional generation method
input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids
sampled_outputs = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
print(tokenizer.decode(sampled_outputs[0][len(input_ids[0]):], skip_special_tokens=False))
# --> The output could be a bad case such as a very short one, e.g., `Sure`
# PairRM for best - of - n sampling
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)
print("### Prompt:\n", prompts[0])
print("### best - of - n generations:\n", outputs[0])
# --> The output will be much more stable and consistently better than single sampling, for example:
"""
Sure, here's a joke about OpenAI:
Why did OpenAI decide to hire a mime as their new AI researcher?
Because they wanted someone who could communicate complex ideas without making a sound!
(Note: This is a joke, not a reflection of OpenAI's actual hiring practices.)
"""
RLHF
PairRM has been trained on various high - quality and large - scale datasets with human preference annotations and shown great correlation with human preferences with an extremely small model size (0.4B), approaching the performance of GPT - 4. With a blender.compare()
function, you can apply PairRM to popular RLHF toolkits such as trl.
đĨ Check more details on our example jupyter notebook usage: blender_usage.ipynb
đ Documentation
News
- Check out our results on AlpacaEval leaderboard: Twitter Leaderboard
Introduction
Pairwise Reward Model (PairRM) takes an instruction and a pair of output candidates as the input, and output a score for each candidate to measure their relative quality. It can be used for multiple purposes including ranking candidate outputs, enhancing decoding, and aligning instruction - tuned LLMs with RLHF methods.
Unlike other RMs that encode and score each candidate respectively, PairRM compares a pair of candidates side - by - side to identify the subtle differences between them.
Statistics
Context length
Property | Details |
---|---|
Model Type | Pairwise Reward Model (PairRM) |
Source max length | 1224 |
Candidate max length | 412 |
Total max length | 2048 |
Training Datasets
- openai/summarize_from_feedback
- openai/webgpt_comparisons
- [Dahoas/synthetic - instruct - gptj - pairwise](https://huggingface.co/datasets/Dahoas/synthetic - instruct - gptj - pairwise)
- [Anthropic/hh - rlhf](https://huggingface.co/datasets/Anthropic/hh - rlhf)
- lmsys/chatbot_arena_conversations
- openbmb/UltraFeedback
Performance
Auto - J Pairwise test data performance
Property | Summ | Exam | Code | Rewriting | Crea W | Func W | Comm | NLP | Overall |
---|---|---|---|---|---|---|---|---|---|
ChatGPT | 33.3 | 40.3 | 36.6 | 31.6 | 48.2 | 40.4 | 47.6 | 45.8 | 42.7 |
Claude - 2 | 30.6 | 36.1 | 41.7 | 34.2 | 48.1 | 42.5 | 40.6 | 48.5 | 42.4 |
GPT - 4 | 59.7 | 51.4 | 69.2 | 58.3 | 66.7 | 60.4 | 58.3 | 65.2 | 61.9 |
SteamSHP | 33.3 | 29.2 | 26.7 | 33.3 | 40.7 | 31.3 | 51.4 | 51.9 | 40.6 |
PandaLM | 29.2 | 33.3 | 31.7 | 23.3 | 43.5 | 32.9 | 44.8 | 48.9 | 38.9 |
LLaMA - 2 - Chat - 13B | 20.8 | 27.8 | 19.2 | 20 | 31.5 | 27.5 | 35.8 | 31.8 | 29 |
Vicuna - 13B - v1.5 | 30.6 | 23.6 | 35 | 28.3 | 36.1 | 37.5 | 45.5 | 39.8 | 37.3 |
WizardLM - 13B - v1.2 | 22.2 | 20.8 | 32.5 | 19.2 | 28.7 | 25.4 | 29.2 | 33 | 27.8 |
LLAMA - 2 - chat - 70B | 34.7 | 33.3 | 36.7 | 35.8 | 51.4 | 54.2 | 47.2 | 47.7 | 45.9 |
AUTO - J (13b) | 45.8 | 38.9 | 59.2 | 47.5 | 54.6 | 57.1 | 58 | 57.6 | 54.8 |
UltraRM (13b) | 56.94 | 43.06 | 55.0 | 53.33 | 67.13 | 64.17 | 56.25 | 59.85 | 59.85 |
PairRM (0.4b) | 56.94 | 52.78 | 58.33 | 55.83 | 61.57 | 59.17 | 57.64 | 62.5 | 59.05 |
HHH - Alignment and MT - bench human judgements
Property | HHH ALIGNMENT - Help | HHH ALIGNMENT - Harm | HHH ALIGNMENT - Hon | HHH ALIGNMENT - Other | HHH ALIGNMENT - Total Avg | MT BENCH HUMAN JUDG - Human Preference |
---|---|---|---|---|---|---|
RANDOM | 50 | 50 | 50 | 50 | 50 | 34.26 |
STANFORDNLP REWARD MODEL | 69.49 | 60.34 | 52.46 | 51.16 | 58.82 | 44.79 |
ALMOST REWARD MODEL | 74.58 | 67.24 | 78.69 | 86.05 | 76.02 | 49.9 |
LLAMA2 - CHAT 7B | 66.1 | 81.03 | 70.49 | 74.42 | 72.85 | 51.78 |
LLAMA2 - CHAT 13B | 74.58 | 87.93 | 55.74 | 79.07 | 73.76 | 52.34 |
LLAMA2 - CHAT 70B | 66.1 | 89.66 | 67.21 | 74.42 | 74.21 | 53.67 |
LLAMA2 - CHAT 13B+COARSE | 68.74 | 68.97 | 65.57 | 67.44 | 67.42 | 46.89 |
GPT - 3.5 - TURBO - 0613 | 76.27 | 87.93 | 67.21 | 86.05 | 78.73 | 57.12 |
PROMETHEUS 7B | 69.49 | 84.48 | 78.69 | 90.7 | 80.09 | 55.14 |
PROMETHEUS 13B | 81.36 | 82.76 | 75.41 | 76.74 | 79.19 | 57.72 |
UltraRM (13B) | 86.44 | 79.31 | 81.97 | 88.37 | 83.71 | 56 |
PairRM (0.4B) | 84.75 | 84.48 | 80.33 | 90.7 | 84.62 | 59 |
GPT - 4 - 0613 | 91.53 | 93.1 | 85.25 | 83.72 | 88.69 | 63.87 |
đ License
This project is licensed under the MIT License.

