đ Reward Model
This reward model is trained from the base model google/gemma-2b-it, aiming to provide effective reward evaluation for related tasks.
đ Quick Start
Prerequisites
- Ensure you have installed the
transformers
library.
- You need a GPU device for better performance.
Code Example
from transformers import AutoTokenizer, pipeline
rm_tokenizer = AutoTokenizer.from_pretrained("weqweasdas/RM-Gemma-2B")
device = 0
rm_pipe = pipeline(
"sentiment-analysis",
model="weqweasdas/RM-Gemma-2B",
device=device,
tokenizer=rm_tokenizer,
model_kwargs={"torch_dtype": torch.bfloat16}
)
pipe_kwargs = {
"return_all_scores": True,
"function_to_apply": "none",
"batch_size": 1
}
chat = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
{"role": "user", "content": "I'd like to show off how chat templating works!"},
]
test_texts = [tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False).replace(tokenizer.bos_token, "")]
pipe_outputs = rm_pipe(test_texts, **pipe_kwargs)
rewards = [output[0]["score"] for output in pipe_outputs]
⨠Features
- Trained from the base model google/gemma-2b-it, with a 7B version RM-Gemma-7B available.
- The training script is open - sourced at https://github.com/WeiXiongUST/RLHF-Reward-Modeling.
- Trained on a mixture of multiple high - quality datasets, with a total of 250K comparison pairs after data selection and cleaning.
- Evaluated on multiple preference datasets, showing good performance.
đĻ Installation
There is no specific installation process described in the original text. If you want to use the model, you can install the necessary libraries as shown in the quick - start code example:
pip install transformers
đģ Usage Examples
Basic Usage
The code in the "Quick Start" section is a basic usage example, which demonstrates how to load the model, tokenize text, and obtain reward scores.
đ Documentation
Model Details
Dataset preprocessing
The model is trained on a mixture of the following datasets:
The total number of the comparison pairs is 250K, and the following data selection and cleaning strategies are applied:
- HH-RLHF: All the base, rejection sampling, and online subsets are used, but samples with chosen == rejected are deleted, resulting in 115547 samples.
- SHP: Only samples with a score ratio > 2 are used. For each prompt, only 1 comparison is taken, resulting in 55916 samples.
- Ultrafeedback: Similar to UltraFeedback-Binarized, the fine - grained score is used instead of the overall one to rank samples. For each prompt, the best one is compared with a randomly chosen one in the remaining samples. Pairs with equal scores are deleted, resulting in 62793 samples.
- HelpSteer: The mean of helpfulness and correctness is used to rank samples. The best sample is compared with a randomly chosen one in the remaining samples. Pairs with equal scores are deleted, resulting in 8206 samples.
- Capybara: Pairs with the same rating for chosen and rejected samples are deleted, resulting in 7562 samples.
- Orca: Pairs with the same rating for chosen and rejected samples are deleted, resulting in 6405 samples.
Training
The model is trained for one epoch with a learning rate of 1e - 5, a batch size of 256, and cosine learning rate decay with a warmup ratio of 0.03. The training curve is shown below:

Results
The existing preference datasets are collected and used as a benchmark to evaluate the resulting reward model.
Note that for the MT - Bench dataset (lmsys/mt_bench_human_judgments), samples with a tie as the comparison results are deleted. The Alpaca data is from Here.
Model/Test set |
HH-RLHF-Helpful |
SHP |
Helpsteer helpful + correctness |
Helpsteer All |
MT Bench Human |
MT Bench GPT4 |
Alpaca Human |
Alpaca GPT4 |
Alpca Human-crossed |
UltraRM-13B |
0.71 |
0.73 |
0.72 |
0.72 |
0.78 |
0.9 |
0.65 |
0.83 |
0.62 |
Pair-RM |
0.65 |
0.56 |
0.62 |
0.6 |
0.74 |
0.82 |
0.62 |
0.75 |
0.59 |
RM-Gemma-2B |
0.68 |
0.73 |
0.68 |
0.72 |
0.77 |
0.87 |
0.63 |
0.78 |
0.59 |
đ§ Technical Details
- The model is based on the google/gemma-2b-it base model.
- The data preprocessing and cleaning strategies ensure the quality of the training data.
- The training hyperparameters (learning rate, batch size, learning rate decay, etc.) are carefully selected to optimize the model performance.
đ License
There is no license information provided in the original text, so this section is skipped.
đ Reference
To be added. The reward model may be readily used for rejection sampling finetuning (
@article{dong2023raft,
title={Raft: Reward ranked finetuning for generative foundation model alignment},
author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
journal={arXiv preprint arXiv:2304.06767},
year={2023}
}