Skywork-Reward-Llama-3.1-8B开源奖励模型 - 处理复杂场景偏好问题超在行

首页

Skywork Reward Llama 3.1 8B

由 Skywork 开发

基于Meta-Llama-3.1-8B-Instruct架构构建的先进奖励模型，擅长处理复杂场景中的偏好问题

大型语言模型

Transformers

#高质量偏好评估 #多领域奖励模型 #小数据集高效训练

下载量 461

发布时间 : 9/5/2024

模型简介

一款高性能奖励模型，专门用于评估和比较不同文本响应的质量，适用于数学、编程和安全等多个领域

模型特点

高质量数据训练

使用精选的8万对高质量偏好数据训练，涵盖数学、编程和安全等多个领域

高性能表现

在RewardBench排行榜上位列第三，在Chat、Chat Hard、Safety和Reasoning等多个维度表现优异

数据筛选技巧

采用创新的数据筛选方法，确保模型在各领域间保持平衡性能

模型能力

文本质量评估

偏好评分

多领域评估(数学、编程、安全等)

复杂场景处理

使用案例

AI训练与优化

强化学习训练

用于强化学习训练中的奖励信号生成

帮助AI模型学习更优的响应策略

模型微调

作为DPO(直接偏好优化)训练的奖励模型

提升模型在特定领域的表现

内容评估

回答质量评估

评估不同AI系统生成的回答质量

准确区分高质量和低质量回答

🚀 Skywork奖励模型系列

Skywork奖励模型系列包含两款先进的奖励模型，分别基于特定架构构建，能有效处理复杂场景下的偏好问题，在RewardBench排行榜上成绩优异。

🚀 快速开始

本部分将引导你快速了解Skywork奖励模型系列的基本情况。

✨ 主要特性

高性能表现：Skywork-Reward-Gemma-2-27B和Skywork-Reward-Llama-3.1-8B分别基于gemma-2-27b-it和Meta-Llama-3.1-8B-Instruct架构构建。截至2024年9月，它们在RewardBench排行榜上分别位居第一和第三。
优质数据驱动：使用仅包含80K高质量偏好对的Skywork奖励数据集合进行训练，这些数据均来自公开可用的数据。
多领域适用：能够处理复杂场景下的偏好问题，涵盖数学、编码和安全等多个领域。

📦 安装指南

文档未提供安装步骤，暂不展示。

💻 使用示例

基础用法

import torch

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load model and tokenizer
device = "cuda:0"
model_name = "Skywork/Skywork-Reward-Llama-3.1-8B"
rm = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=device,
    attn_implementation="flash_attention_2",
    num_labels=1,
)
rm_tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
response1 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
response2 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."

conv1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
conv2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]

# Format and tokenize the conversations
conv1_formatted = rm_tokenizer.apply_chat_template(conv1, tokenize=False)
conv2_formatted = rm_tokenizer.apply_chat_template(conv2, tokenize=False)
conv1_tokenized = rm_tokenizer(conv1_formatted, return_tensors="pt").to(device)
conv2_tokenized = rm_tokenizer(conv2_formatted, return_tensors="pt").to(device)

# Get the reward scores
with torch.no_grad():
    score1 = rm(**conv1_tokenized).logits[0][0].item()
    score2 = rm(**conv2_tokenized).logits[0][0].item()
print(f"Score for response 1: {score1}")
print(f"Score for response 2: {score2}")

# Output:
# Score for response 1: 12.625
# Score for response 2: -15.25

高级用法

文档未提供高级用法代码，暂不展示。

📚 详细文档

数据混合

我们没有依赖现有的大型偏好数据集，而是精心策划了Skywork奖励数据集合，目的是（1）包含高质量的偏好对；（2）针对特定的能力和知识领域。经过策划的训练数据集大约包含80K个样本，这些样本是从多个公开可用的数据源中二次采样得到的，包括：

HelpSteer2
OffsetBias
WildGuard (对抗性)
Magpie DPO系列：Ultra、Pro (Llama-3.1)、Pro、Air。

声明：除了对上述原始数据集进行二次采样以创建Skywork奖励数据集合外，我们没有对其进行任何修改。

在数据集策划过程中，我们采用了一些技巧，以在不影响整体性能的前提下，实现性能提升和各领域之间的平衡：

我们根据数据集中提供的平均ArmoRM分数，分别从合并后的Magpie数据集中的数学、代码和其他类别中选择顶级样本。我们将Magpie-Air子集和Magpie-Pro子集中的ArmoRM平均分数分别减去0.1和0.05，以优先选择Magpie-Ultra和Magpie-Pro-Llama-3.1样本。
我们没有将WildGuard中的所有偏好对都包含在内，而是首先在其他三个数据源上训练了一个奖励模型（RM）。然后，（1）使用这个RM对WildGuard中所有样本的选择响应和拒绝响应进行评分；（2）仅选择选择响应的RM分数大于拒绝响应的RM分数的样本。我们观察到，这种方法在提高安全性的同时，在很大程度上保留了聊天、困难聊天和推理的原始性能。对于这两个模型，我们使用27B模型对WildGuard样本进行评分。

RewardBench排行榜

我们使用官方测试脚本在RewardBench上对我们的模型进行了评估。截至2024年9月，Skywork-Reward-Gemma-2-27B和Skywork-Reward-Llama-3.1-8B在RewardBench排行榜上分别排名第一和第三。

排名	模型	聊天	困难聊天	安全	推理	分数
1	Skywork-Reward-Gemma-2-27B	95.8	91.4	92.0	96.1	93.8
2	SFR-LLaMa-3.1-70B-Judge-r	96.9	84.8	92.2	97.6	92.8
3	Skywork-Reward-Llama-3.1-8B	95.8	87.3	90.6	96.2	92.5
4	Nemotron-4-340B-Reward	95.8	87.1	92.2	93.6	92.2
5	ArmoRM-Llama3-8B-v0.1	96.9	76.8	92.2	97.3	90.8
6	SFR-nemo-12B-Judge-r	97.2	82.2	87.5	95.1	90.5
7	internlm2-20b-reward	98.9	76.5	89.9	95.8	90.3

声明与许可协议

声明

我们在此声明，Skywork模型不得用于任何危害国家或社会安全的活动，或从事非法行为。此外，我们要求用户在未进行适当的安全审查和记录的情况下，不要将Skywork模型部署到互联网服务中。我们希望所有用户都能遵守这一原则，以确保技术进步在规范和合法的环境中进行。

我们已尽最大努力确保模型训练过程中使用的数据的合规性。然而，尽管我们付出了巨大努力，但由于模型和数据的复杂性，仍可能存在不可预测的风险和问题。因此，如果因使用Skywork开源模型而出现任何问题，包括但不限于数据安全问题、舆论风险，或因模型被误导、滥用、传播或不当使用而产生的任何风险和问题，我们将不承担任何责任。

许可协议

Skywork模型的社区使用需要遵循Skywork社区许可协议。Skywork模型支持商业使用。如果您计划将Skywork模型或其衍生产品用于商业目的，则必须遵守Skywork社区许可协议中的条款和条件。

技术报告

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

联系我们

如果您有任何问题，请随时通过yuhao.liuu@kunlun-inc.com或liang.zeng@kunlun-inc.com与我们联系。

引用

如果您认为我们的工作有帮助，请使用以下BibTeX条目引用我们：

@article{liu2024skywork,
  title={Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs},
  author={Liu, Chris Yuhao and Zeng, Liang and Liu, Jiacai and Yan, Rui and He, Jujie and Wang, Chaojie and Yan, Shuicheng and Liu, Yang and Zhou, Yahui},
  journal={arXiv preprint arXiv:2410.18451},
  year={2024}
}