FsfairX-LLaMA3-RM-v0.1 Open-source Reward Model - Supporting Multi-method RLHF Reward Modeling

Fsfairx LLaMA3 RM V0.1

Developed by sfairXC

A reward model trained on Meta-Llama-3-8B-Instruct for reward modeling in RLHF processes, supporting PPO, iterative SFT, and iterative DPO methods.

Large Language Model

Transformers

#RLHF reward modeling #Open-source SOTA #Multi-turn dialogue evaluation

Downloads 4,157

Release Time : 4/20/2024

Model Overview

This model is a reward model for Reinforcement Learning from Human Feedback (RLHF) processes, capable of evaluating dialogue quality and providing reward signals to help optimize language model outputs.

Model Features

High-performance reward modeling

Performs excellently on the Reward-Bench leaderboard and is one of the most advanced open-source reward models available.

Supports multiple RLHF methods

Can be used with various reinforcement learning from human feedback methods such as PPO, iterative SFT, and iterative DPO.

Based on Llama-3 architecture

Fine-tuned from the Meta-Llama-3-8B-Instruct model, inheriting its powerful language understanding capabilities.

Model Capabilities

Dialogue quality evaluation

Reward signal generation

Reinforcement learning feedback

Use Cases

Language model optimization

Reward modeling in RLHF processes

Used as a reward model in reinforcement learning from human feedback processes to guide language model optimization.

Significantly improves dialogue quality and safety of language models

Dialogue system evaluation

Dialogue quality scoring

Evaluates and scores the quality of responses from dialogue systems.

🚀 RLHF Reward Modeling

This project provides a reward function for RLHF, applicable to PPO, iterative SFT, and iterative DPO.

📚 Documentation

Paper: RLHF Workflow: From Reward Modeling to Online RLHF (Published in TMLR, 2024)
Authors: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
Code: https://github.com/RLHFlow/RLHF-Reward-Modeling/

📦 Installation

The license is derived from PKU-Alignment/PKU-SafeRLHF-30K.

🔧 Technical Details

Training

The base model is meta-llama/Meta-Llama-3-8B-Instruct. We use the training script at https://github.com/WeiXiongUST/RLHF-Reward-Modeling.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, pipeline
rm_tokenizer = AutoTokenizer.from_pretrained("sfairXC/FsfairX-LLaMA3-RM-v0.1")
device = 0 # accelerator.device
rm_pipe = pipeline(
    "sentiment-analysis",
    model="sfairXC/FsfairX-LLaMA3-RM-v0.1",
    #device="auto",
    device=device,
    tokenizer=rm_tokenizer,
    model_kwargs={"torch_dtype": torch.bfloat16}
)

pipe_kwargs = {
    "return_all_scores": True,
    "function_to_apply": "none",
    "batch_size": 1
}

chat = [
 {"role": "user", "content": "Hello, how are you?"},
 {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
 {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

test_texts = [rm_tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False).replace(rm_tokenizer.bos_token, "")]
pipe_outputs = rm_pipe(test_texts, **pipe_kwargs)
rewards = [output[0]["score"] for output in pipe_outputs]

📈 Results

This Reward model is the SOTA open-source RM (Apr 20, 2024) on Reward-Bench.

Property	Details
Chat	99.44
Chat Hard	65.13
Safety	88.76
Reasoning	88.3

📄 License

The license is CC BY-NC 4.0.

📚 References

The repo was part of the iterative rejection sampling fine-tuning and iterative DPO. If you find the content of this repo useful in your work, please consider cite it as follows:

@article{dong2023raft,
  title={Raft: Reward ranked finetuning for generative foundation model alignment},
  author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
  journal={arXiv preprint arXiv:2304.06767},
  year={2023}
}

@misc{xiong2024iterative,
      title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint}, 
      author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},
      year={2024},
      eprint={2312.11456},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご