RM-Mistral-7B Open-Source Reward Model - Free for Response Quality Evaluation in RLHF Scenarios

RM Mistral 7B

Developed by weqweasdas

A reward model trained on Mistral-7B for response quality evaluation in Reinforcement Learning from Human Feedback (RLHF) scenarios

Large Language Model

Transformers

#RLHF reward model #Multi-source preference data fusion #Dialogue quality scoring

Downloads 552

Release Time : 3/22/2024

Model Overview

This reward model is specifically designed to assess dialogue response quality and serves as a scoring module in RLHF workflows

Model Features

Multi-dataset fusion training

Integrates 6 high-quality human preference datasets (HH-RLHF/SHP/UltraFeedback, etc.) with rigorous data cleaning

Fine-grained scoring capability

Supports multi-dimensional (helpfulness/correctness, etc.) fine-grained response quality evaluation

High-performance

Ranked second on the RewardBench leaderboard with excellent discriminative ability

Model Capabilities

Dialogue response quality evaluation

Human preference prediction

Reinforcement learning reward signal generation

Use Cases

AI dialogue system development

RLHF training workflow

Serves as a reward model in RLHF processes

Enhances dialogue system response quality and safety

Response quality monitoring

Real-time evaluation of AI assistant responses

Assists manual review and system optimization

🚀 Reward Model

The reward model is trained from the base model mistralai/Mistral-7B-Instruct-v0.2, aiming to provide more accurate reward evaluation.

🚀 Quick Start

The reward model is trained from the base model mistralai/Mistral-7B-Instruct-v0.2. The training script is available at https://github.com/WeiXiongUST/RLHF-Reward-Modeling. Also, you can see a short blog for the training details (data mixture, parameters...) at https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0.

✨ Features

If you have any questions about this reward model or reward modeling in general, feel free to drop an email to wx13@illinois.edu. I'd be glad to have a chat!

📦 Installation

Not provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, pipeline
rm_tokenizer = AutoTokenizer.from_pretrained("weqweasdas/RM-Mistral-7B")
device = 0 # accelerator.device
rm_pipe = pipeline(
    "sentiment-analysis",
    model="weqweasdas/RM-Mistral-7B",
    #device="auto",
    device=device,
    tokenizer=rm_tokenizer,
    model_kwargs={"torch_dtype": torch.bfloat16}
)

pipe_kwargs = {
    "return_all_scores": True,
    "function_to_apply": "none",
    "batch_size": 1
}

chat = [
 {"role": "user", "content": "Hello, how are you?"},
 {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
 {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

test_texts = [tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False).replace(tokenizer.bos_token, "")]
pipe_outputs = rm_pipe(test_texts, **pipe_kwargs)
rewards = [output[0]["score"] for output in pipe_outputs]

📚 Documentation

Dataset preprocessing

The model is trained on a mixture of the following datasets. We also provide the mixture in weqweasdas/preference_dataset_mixture2_and_safe_pku.

Difference between this mixture and the original dataset:

HH-RLHF: we only use the helpful subset and we delete the noisy samples where chosen_response == rejected_response;
SHP: we only use the samples with score ratio > 2, for each prompt, we take 5 comparison at most, leading to 109526;
Ultrafeedback: similar to UltraFeedback-Binarized, we use the fine-grained score instead of the overall one to rank samples. Meanwhile, for each prompt, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 267416.
HelpSteer: we use the mean of helpfulness and correctness to rank samples. Meanwhile, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 21576.

Training

We train the model for one epoch with a learning rate of 5e-6, batch size 512, cosine learning rate decay with a warmup ratio 0.03.

🔧 Technical Details

The reward model ranks 2nd in the RewardBench.

📄 License

Not provided in the original document, so this section is skipped.

📖 Reference

The repo was part of the iterative rejection sampling fine-tuning and iterative DPO. If you find the content of this repo useful in your work, please consider citing it as follows:

@article{dong2023raft,
  title={Raft: Reward ranked finetuning for generative foundation model alignment},
  author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
  journal={arXiv preprint arXiv:2304.06767},
  year={2023}
}

@misc{xiong2024iterative,
      title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint}, 
      author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},
      year={2024},
      eprint={2312.11456},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご