Skywork-Reward-Llama-3.1-8B开源奖励模型 - 处理复杂场景偏好问题超在行

Home

Skywork Reward Llama 3.1 8B

Developed by Skywork

基于Meta-Llama-3.1-8B-Instruct架构构建的先进奖励模型，擅长处理复杂场景中的偏好问题

大型语言模型

Transformers

#高质量偏好评估 #多领域奖励模型 #小数据集高效训练

Downloads 461

Release Time : 9/5/2024

Model Overview

一款高性能奖励模型，专门用于评估和比较不同文本响应的质量，适用于数学、编程和安全等多个领域

Model Features

高质量数据训练

使用精选的8万对高质量偏好数据训练，涵盖数学、编程和安全等多个领域

高性能表现

在RewardBench排行榜上位列第三，在Chat、Chat Hard、Safety和Reasoning等多个维度表现优异

数据筛选技巧

采用创新的数据筛选方法，确保模型在各领域间保持平衡性能

Model Capabilities

文本质量评估

偏好评分

多领域评估(数学、编程、安全等)

复杂场景处理

Use Cases

AI训练与优化

强化学习训练

用于强化学习训练中的奖励信号生成

帮助AI模型学习更优的响应策略

模型微调

作为DPO(直接偏好优化)训练的奖励模型

提升模型在特定领域的表现

内容评估

回答质量评估

评估不同AI系统生成的回答质量

准确区分高质量和低质量回答

🚀 Skywork奖励模型系列

Skywork奖励模型系列包含两款先进的奖励模型，分别基于特定架构构建，能有效处理复杂场景下的偏好问题，在RewardBench排行榜上成绩优异。

🚀 快速开始

本部分将引导你快速了解Skywork奖励模型系列的基本情况。

✨ 主要特性

高性能表现：Skywork-Reward-Gemma-2-27B和Skywork-Reward-Llama-3.1-8B分别基于gemma-2-27b-it和Meta-Llama-3.1-8B-Instruct架构构建。截至2024年9月，它们在RewardBench排行榜上分别位居第一和第三。
优质数据驱动：使用仅包含80K高质量偏好对的Skywork奖励数据集合进行训练，这些数据均来自公开可用的数据。
多领域适用：能够处理复杂场景下的偏好问题，涵盖数学、编码和安全等多个领域。

📦 安装指南

文档未提供安装步骤，暂不展示。

💻 使用示例

基础用法

import torch

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load model and tokenizer
device = "cuda:0"
model_name = "Skywork/Skywork-Reward-Llama-3.1-8B"
rm = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=device,
    attn_implementation="flash_attention_2",
    num_labels=1,
)
rm_tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
response1 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
response2 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."

conv1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
conv2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]

# Format and tokenize the conversations
conv1_formatted = rm_tokenizer.apply_chat_template(conv1, tokenize=False)
conv2_formatted = rm_tokenizer.apply_chat_template(conv2, tokenize=False)
conv1_tokenized = rm_tokenizer(conv1_formatted, return_tensors="pt").to(device)
conv2_tokenized = rm_tokenizer(conv2_formatted, return_tensors="pt").to(device)

# Get the reward scores
with torch.no_grad():
    score1 = rm(**conv1_tokenized).logits[0][0].item()
    score2 = rm(**conv2_tokenized).logits[0][0].item()
print(f"Score for response 1: {score1}")
print(f"Score for response 2: {score2}")

# Output:
# Score for response 1: 12.625
# Score for response 2: -15.25

高级用法

文档未提供高级用法代码，暂不展示。

📚 详细文档

数据混合

我们没有依赖现有的大型偏好数据集，而是精心策划了Skywork奖励数据集合，目的是（1）包含高质量的偏好对；（2）针对特定的能力和知识领域。经过策划的训练数据集大约包含80K个样本，这些样本是从多个公开可用的数据源中二次采样得到的，包括：

HelpSteer2
OffsetBias
WildGuard (对抗性)
Magpie DPO系列：Ultra、Pro (Llama-3.1)、Pro、Air。

声明：除了对上述原始数据集进行二次采样以创建Skywork奖励数据集合外，我们没有对其进行任何修改。

在数据集策划过程中，我们采用了一些技巧，以在不影响整体性能的前提下，实现性能提升和各领域之间的平衡：

我们根据数据集中提供的平均ArmoRM分数，分别从合并后的Magpie数据集中的数学、代码和其他类别中选择顶级样本。我们将Magpie-Air子集和Magpie-Pro子集中的ArmoRM平均分数分别减去0.1和0.05，以优先选择Magpie-Ultra和Magpie-Pro-Llama-3.1样本。
我们没有将WildGuard中的所有偏好对都包含在内，而是首先在其他三个数据源上训练了一个奖励模型（RM）。然后，（1）使用这个RM对WildGuard中所有样本的选择响应和拒绝响应进行评分；（2）仅选择选择响应的RM分数大于拒绝响应的RM分数的样本。我们观察到，这种方法在提高安全性的同时，在很大程度上保留了聊天、困难聊天和推理的原始性能。对于这两个模型，我们使用27B模型对WildGuard样本进行评分。

RewardBench排行榜

我们使用官方测试脚本在RewardBench上对我们的模型进行了评估。截至2024年9月，Skywork-Reward-Gemma-2-27B和Skywork-Reward-Llama-3.1-8B在RewardBench排行榜上分别排名第一和第三。

排名	模型	聊天	困难聊天	安全	推理	分数
1	Skywork-Reward-Gemma-2-27B	95.8	91.4	92.0	96.1	93.8
2	SFR-LLaMa-3.1-70B-Judge-r	96.9	84.8	92.2	97.6	92.8
3	Skywork-Reward-Llama-3.1-8B	95.8	87.3	90.6	96.2	92.5
4	Nemotron-4-340B-Reward	95.8	87.1	92.2	93.6	92.2
5	ArmoRM-Llama3-8B-v0.1	96.9	76.8	92.2	97.3	90.8
6	SFR-nemo-12B-Judge-r	97.2	82.2	87.5	95.1	90.5
7	internlm2-20b-reward	98.9	76.5	89.9	95.8	90.3

声明与许可协议

声明

我们在此声明，Skywork模型不得用于任何危害国家或社会安全的活动，或从事非法行为。此外，我们要求用户在未进行适当的安全审查和记录的情况下，不要将Skywork模型部署到互联网服务中。我们希望所有用户都能遵守这一原则，以确保技术进步在规范和合法的环境中进行。

我们已尽最大努力确保模型训练过程中使用的数据的合规性。然而，尽管我们付出了巨大努力，但由于模型和数据的复杂性，仍可能存在不可预测的风险和问题。因此，如果因使用Skywork开源模型而出现任何问题，包括但不限于数据安全问题、舆论风险，或因模型被误导、滥用、传播或不当使用而产生的任何风险和问题，我们将不承担任何责任。

许可协议

Skywork模型的社区使用需要遵循Skywork社区许可协议。Skywork模型支持商业使用。如果您计划将Skywork模型或其衍生产品用于商业目的，则必须遵守Skywork社区许可协议中的条款和条件。

技术报告

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

联系我们

如果您有任何问题，请随时通过yuhao.liuu@kunlun-inc.com或liang.zeng@kunlun-inc.com与我们联系。

引用

如果您认为我们的工作有帮助，请使用以下BibTeX条目引用我们：

@article{liu2024skywork,
  title={Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs},
  author={Liu, Chris Yuhao and Zeng, Liang and Liu, Jiacai and Yan, Rui and He, Jujie and Wang, Chaojie and Yan, Shuicheng and Liu, Yang and Zhou, Yahui},
  journal={arXiv preprint arXiv:2410.18451},
  year={2024}
}