RM-Gemma-2B Open-Source Reward Model - Free Evaluation of Text Generation Quality to Improve Content Level

Home

RM Gemma 2B

Developed by weqweasdas

A reward model trained on google/gemma-2b-it for evaluating text generation quality

Large Language Model

Transformers

#RLHF reward model #Multi-dataset fusion #Dialogue quality evaluation

Downloads 2,618

Release Time : 2/25/2024

Model Overview

This reward model is trained based on the Gemma-2B foundation model, specifically designed for evaluating and ranking the quality of different text generation results, suitable for Reinforcement Learning from Human Feedback (RLHF) scenarios.

Model Features

Multi-source dataset training

Incorporates 6 high-quality datasets including HH-RLHF, SHP, UltraFeedback, totaling 250,000 sets of comparison data

Rigorous data cleaning

Employs multiple strategies to ensure the quality of comparison data, such as retaining samples with significant differences and removing equal-score samples

Efficient training configuration

Utilizes optimized training settings including a learning rate of 1e-5, batch size of 256, and cosine learning rate decay

Model Capabilities

Text quality scoring

Generation result ranking

Dialogue response evaluation

Reinforcement learning feedback

Use Cases

Reinforcement learning

Rejection sampling fine-tuning

Used in the rejection sampling phase of RLHF workflows to filter high-quality generation results

Can be directly applied to RAFT (Reward rAnked FineTuning) algorithm

Dialogue systems

Chatbot response evaluation

Evaluates the quality of different chatbot responses to select the best reply

Performs well on benchmarks like MT Bench

🚀 Reward Model

This reward model is trained from the base model google/gemma-2b-it, aiming to provide effective reward evaluation for related tasks.

🚀 Quick Start

Prerequisites

Ensure you have installed the transformers library.
You need a GPU device for better performance.

Code Example

from transformers import AutoTokenizer, pipeline
rm_tokenizer = AutoTokenizer.from_pretrained("weqweasdas/RM-Gemma-2B")
device = 0 # accelerator.device
rm_pipe = pipeline(
    "sentiment-analysis",
    model="weqweasdas/RM-Gemma-2B",
    #device="auto",
    device=device,
    tokenizer=rm_tokenizer,
    model_kwargs={"torch_dtype": torch.bfloat16}
)

pipe_kwargs = {
    "return_all_scores": True,
    "function_to_apply": "none",
    "batch_size": 1
}

chat = [
 {"role": "user", "content": "Hello, how are you?"},
 {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
 {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

test_texts = [tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False).replace(tokenizer.bos_token, "")]
pipe_outputs = rm_pipe(test_texts, **pipe_kwargs)
rewards = [output[0]["score"] for output in pipe_outputs]

✨ Features

Trained from the base model google/gemma-2b-it, with a 7B version RM-Gemma-7B available.
The training script is open - sourced at https://github.com/WeiXiongUST/RLHF-Reward-Modeling.
Trained on a mixture of multiple high - quality datasets, with a total of 250K comparison pairs after data selection and cleaning.
Evaluated on multiple preference datasets, showing good performance.

📦 Installation

There is no specific installation process described in the original text. If you want to use the model, you can install the necessary libraries as shown in the quick - start code example:

pip install transformers

💻 Usage Examples

Basic Usage

The code in the "Quick Start" section is a basic usage example, which demonstrates how to load the model, tokenize text, and obtain reward scores.

📚 Documentation

Model Details

Dataset preprocessing

The model is trained on a mixture of the following datasets:

The total number of the comparison pairs is 250K, and the following data selection and cleaning strategies are applied:

HH-RLHF: All the base, rejection sampling, and online subsets are used, but samples with chosen == rejected are deleted, resulting in 115547 samples.
SHP: Only samples with a score ratio > 2 are used. For each prompt, only 1 comparison is taken, resulting in 55916 samples.
Ultrafeedback: Similar to UltraFeedback-Binarized, the fine - grained score is used instead of the overall one to rank samples. For each prompt, the best one is compared with a randomly chosen one in the remaining samples. Pairs with equal scores are deleted, resulting in 62793 samples.
HelpSteer: The mean of helpfulness and correctness is used to rank samples. The best sample is compared with a randomly chosen one in the remaining samples. Pairs with equal scores are deleted, resulting in 8206 samples.
Capybara: Pairs with the same rating for chosen and rejected samples are deleted, resulting in 7562 samples.
Orca: Pairs with the same rating for chosen and rejected samples are deleted, resulting in 6405 samples.

Training

The model is trained for one epoch with a learning rate of 1e - 5, a batch size of 256, and cosine learning rate decay with a warmup ratio of 0.03. The training curve is shown below:

Training Loss

Results

The existing preference datasets are collected and used as a benchmark to evaluate the resulting reward model.

Note that for the MT - Bench dataset (lmsys/mt_bench_human_judgments), samples with a tie as the comparison results are deleted. The Alpaca data is from Here.

Model/Test set	HH-RLHF-Helpful	SHP	Helpsteer helpful + correctness	Helpsteer All	MT Bench Human	MT Bench GPT4	Alpaca Human	Alpaca GPT4	Alpca Human-crossed
UltraRM-13B	0.71	0.73	0.72	0.72	0.78	0.9	0.65	0.83	0.62
Pair-RM	0.65	0.56	0.62	0.6	0.74	0.82	0.62	0.75	0.59
RM-Gemma-2B	0.68	0.73	0.68	0.72	0.77	0.87	0.63	0.78	0.59

🔧 Technical Details

The model is based on the google/gemma-2b-it base model.
The data preprocessing and cleaning strategies ensure the quality of the training data.
The training hyperparameters (learning rate, batch size, learning rate decay, etc.) are carefully selected to optimize the model performance.

📄 License

There is no license information provided in the original text, so this section is skipped.

📖 Reference

To be added. The reward model may be readily used for rejection sampling finetuning (

@article{dong2023raft,
  title={Raft: Reward ranked finetuning for generative foundation model alignment},
  author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
  journal={arXiv preprint arXiv:2304.06767},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご