Skywork-Critic-Llama-3.1-70B开源评判模型 - 免费对比文本对，评估质量适用性

首页

Skywork Critic Llama 3.1 70B

由 Skywork 开发

天工评判系列模型由天工AI对齐团队开发，包含70B和8B两款先进的评判模型，擅长进行成对偏好评估，能够对输入的文本对进行细致比较，判断其相对质量或适用性。

大型语言模型

PyTorch

开源协议:其他 #成对偏好评估 #奖励建模 #高质量数据微调

下载量 1,413

发布时间 : 9/19/2024

模型简介

天工评判系列模型基于Meta的Llama-3.1系列模型进行微调，专注于成对偏好评估和一般聊天任务，在数据改进、评估和奖励建模等应用场景中具有重要价值。

模型特点

成对偏好评估

能够对输入的文本对进行细致比较，判断其相对质量或适用性。

多场景应用

可用于数据改进、评估和奖励建模等多种应用场景。

高性能表现

在RewardBench排行榜上取得了优异的成绩，70B版本在所有规模的生成模型中排名第一。

模型能力

文本对质量评估

偏好数据选择

指令-响应对评分

多维度评判分析

使用案例

数据改进

DPO训练数据选择

用于区分直接偏好优化（DPO）训练中的选择和拒绝的训练数据。

提高模型训练数据的质量

模型评估

响应质量评估

对AI助手的响应进行多维度评分和分析。

提供详细的评估报告和改进建议

🚀 天工评判系列模型介绍

天工评判系列模型由天工AI对齐团队开发，包含 天工评判-Llama3.1-70B 和 天工评判-Llama3.1-8B 两款先进的评判模型。这些模型擅长进行成对偏好评估，能够对输入的文本对进行细致比较，判断其相对质量或适用性。凭借对语言和上下文的深度理解，天工评判模型在数据改进、评估和奖励建模等应用场景中具有重要价值。

🤗 Hugging Face • 🤖 ModelScope

🚀 快速开始

天工评判系列模型可用于多种自然语言处理任务，如数据改进、评估和奖励建模等。以下将详细介绍模型的训练细节、评估结果、使用示例以及相关声明和许可协议。

✨ 主要特性

成对偏好评估：能够对输入的文本对进行细致比较，判断其相对质量或适用性。
多场景应用：可用于数据改进、评估和奖励建模等多种应用场景。
高性能表现：在RewardBench排行榜上取得了优异的成绩，天工评判-Llama3.1-70B在所有规模的生成模型中排名第一，天工评判-Llama3.1-8B在参数小于10B的生成模型中排名第一。

📦 安装指南

暂未提供相关安装步骤，可参考模型的官方仓库进行安装。

💻 使用示例

基础用法

以下是将天工评判模型作为偏好数据选择器的示例代码，用于区分直接偏好优化（DPO）训练中的选择和拒绝的训练数据。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# An Example Case
prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
responseA = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
responseB = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."

# feed a natural language prompt to generative model
prompt_template = """Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user\'s instructions and answers the user\'s question better. 
Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. 
Please directly output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better.

[User Question]
{input}

[The Start of Assistant A's Answer]
{response_a}
[The End of Assistant A's Answer]

[The Start of Assistant B's Answer]
{response_b}
[The End of Assistant B's Answer]
"""

user_message = prompt_template.format(input=prompt, response_a=responseA, response_b=responseB)

conversation = [{"role": "user", "content": user_message}]

model_name = "Skywork/Skywork-Critic-Llama3.1-70B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_ids = tokenizer.apply_chat_template(
    conversation, 
    tokenize=True, 
    add_generation_prompt=True,
    return_tensors="pt").to(model.device)

generation = model.generate(
    input_ids=input_ids,
    max_new_tokens=2048,
    do_sample=False,
    pad_token_id=128009,
    temperature=0)

completion = tokenizer.decode(
    generation[0][len(input_ids[0]):], 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=True)

print(completion)

# Output:
# The generative model should output "[[A]]"

高级用法

以下是将天工评判模型作为评判器的示例代码，用于为指令-响应对生成评分和理由。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# An Example Case
prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"

# Chosen Response
responseA = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."

# Rejected Response
responseB = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."

# feed a natural language prompt to generative model
single_rating_prompt_template = """请扮演一个专业的对话质量评价专家，能够从多个维度诊断和分析 AI 助手对用户问题的回答，并进行总体打分（分值范围是 1-5）。你的评估应考虑回答的有用性、相关性、准确性、深度、创造性、安全性等众多维度，请注意，不同任务类型的指令对评估分析维度的侧重不一样，需要根据具体的问题进行分析。

[用户问题]
{input}

[助手回答的开始]
{response_a}
[助手回答的结束]

你的详细评估和总体打分为: """

# Critic responseA and responseB separately.
user_message = single_rating_prompt_template.format(input=prompt, response_a=responseA)
# user_message = single_rating_prompt_template.format(input=prompt, response_a=responseB)


conversation = [{"role": "user", "content": user_message}]

print(f"conversation => {conversation}")


model_name = "70B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_ids = tokenizer.apply_chat_template(
    conversation, 
    tokenize=True, 
    add_generation_prompt=True, 
    return_tensors="pt").to(model.device)

generation = model.generate(
    input_ids=input_ids,
    max_new_tokens=2048,
    do_sample=False,
    pad_token_id=128009,
    temperature=0)

completion = tokenizer.decode(
    generation[0][len(input_ids[0]):], 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=True)


print(completion)

模型生成的选择响应A的评判

**评估分析：**

1. **有用性**：回答直接解决了用户的问题，提供了清晰的步骤和计算过程，非常有用。
2. **相关性**：回答完全相关于用户的问题，没有偏离主题。
3. **准确性**：计算过程准确无误，结果正确。
4. **深度**：回答提供了详细的步骤和解释，帮助用户理解问题的解决过程。
5. **创造性**：虽然回答是直接的计算过程，但在解释每一步时使用了简单的语言和明确的计算，使得回答易于理解。
6. **安全性**：回答中没有包含任何可能引起误解或危险的信息。

**总体打分：5**

**总结**：这个回答非常清晰、准确且有用，完全满足了用户的需求。它通过逐步解释和计算，帮助用户理解问题的解决过程。没有发现任何需要改进的地方，因此给予满分。

模型生成的拒绝响应B的评判

**评估分析：**

1. **有用性**：回答直接解决了用户的问题，提供了每个人最终得到的苹果数量。
2. **相关性**：回答紧密相关于用户的问题，没有偏离主题。
3. **准确性**：回答中的计算过程是正确的，但在最后一步中出现了错误。Jane 分苹果时，应该是将苹果分给自己和她的2个兄弟姐妹，总共3个人，而不是2个人。
4. **深度**：回答提供了详细的计算步骤，帮助用户理解问题的解决过程。
5. **创造性**：回答是直接的，没有特别的创造性，但对于这个简单的数学问题来说，直接的回答是合适的。
6. **安全性**：回答没有包含任何可能引起误解或危险的信息。

**总体打分：** 4

**改进建议：**
- 在最后一步中，正确地计算每个人得到的苹果数量。Jane 应该将 9 个苹果分给自己和她的 2 个兄弟姐妹，总共 3 个人。因此，每个人得到的苹果数量应该是 9 ÷ 3 = 3 个苹果。

📚 详细文档

训练细节

天工评判-Llama3.1-70B和天工评判-Llama3.1-8B分别基于Meta的 Llama-3.1-70B-Instruct 和 Llama-3.1-8B-Instruct 构建。这些模型使用了多种高质量数据集进行微调，包括：

清理后的开源数据：使用了 HelpSteer2、OffsetBias、WildGuard (adversarial) 和Magpie DPO系列（Ultra、Pro (Llama-3.1)、Pro、Air）等数据集的高质量子集。更多详细信息，请参考 Skywork-Reward-Preference-80K-v0.1数据集。此外，还将一些开源的高质量评判数据集，如 Open-Critic-GPT 集成到训练过程中。
内部人工标注数据：包括对单个响应的多个维度进行逐点评分以及两个响应之间的成对比较。每个维度都包含了评分的理由。需要注意的是，手动标注数据的获取成本非常高，只有几百个手动标注的数据点，且全部为中文，因此进行单评分的能力可能不是特别强。
合成评判数据：使用了类似于 self-taught 的方法。具体来说，采用了两种方法为给定的指令生成较差的响应：1) 创建一个类似的指令，然后为这个新指令生成响应。2) 在高质量响应中引入细微的错误。
与评判相关的聊天数据：纳入与评判相关的聊天数据，以保持模型的对话能力。

训练采用指令微调方法，专注于成对偏好评估和一般聊天任务。进行了全面的验证过程，以确保训练数据集不包含RewardBench的任何测试集信息，维护评估结果的完整性。

RewardBench生成模型排行榜

使用官方测试脚本在 RewardBench 上对模型进行评估。截至2024年9月，天工评判-Llama3.1-70B在所有规模的生成模型中排名第一，天工评判-Llama3.1-8B在参数小于10B的生成模型中排名第一。（注：星号 (*) 表示开源模型。）

模型	聊天	困难聊天	安全性	推理	总体得分
天工评判-Llama3.1-70B *	96.6	87.9	93.1	95.5	93.3
Salesforce/SFR-LLaMa-3.1-70B-Judge-r	96.9	84.8	91.6	97.6	92.7
Salesforce/SFR-nemo-12B-Judge-r	97.2	82.2	86.5	95.1	90.3
天工评判-Llama3.1-8B *	93.6	81.4	91.1	89.8	89.0
Salesforce/SFR-LLaMa-3.1-8B-Judge-r	95.5	77.7	86.2	95.1	88.7
facebook/Self-taught-Llama-3-70B	96.9	84.0	91.1	82.5	88.6
google/gemini-1.5-pro-0514	92.3	80.6	87.9	92.0	88.2
openai/gpt-4o-2024-08-06	96.1	76.1	88.1	86.6	86.7
openai/gpt-4-0125-preview	95.3	74.3	87.6	86.9	86.0
openai/gpt-4-turbo-2024-04-09	95.3	75.4	87.6	82.7	85.2
Anthropic/claude-3-5-sonnet-20240620	96.4	74.0	81.6	84.7	84.2
meta-llama/Meta-Llama-3.1-70B-Instruct *	97.2	70.2	82.8	86.0	84.0
NCSOFT/Llama-3-OffsetBias-8B *	92.5	80.3	86.8	76.4	84.0

🔧 技术细节

模型基于Meta的Llama-3.1系列模型进行微调，采用指令微调方法，专注于成对偏好评估和一般聊天任务。训练过程中使用了多种高质量数据集，包括清理后的开源数据、内部人工标注数据、合成评判数据和与评判相关的聊天数据。通过全面的验证过程，确保训练数据集不包含RewardBench的任何测试集信息，维护评估结果的完整性。

📄 许可证

声明

声明天工模型不得用于任何对国家或社会安全构成威胁的活动或从事非法行为。此外，要求用户在未进行适当的安全审查和记录的情况下，不要将天工模型部署到互联网服务中。希望所有用户遵守这一原则，确保技术进步在规范和合法的环境中进行。

尽管已尽最大努力确保模型训练过程中使用的数据的合规性，但由于模型和数据的复杂性，仍可能存在不可预测的风险和问题。因此，如果因使用天工开源模型而出现任何问题，包括但不限于数据安全问题、舆论风险或因模型被误导、滥用、传播或不当使用而产生的任何风险和问题，将不承担任何责任。

许可协议

天工模型的社区使用需要遵循天工社区许可证。天工模型支持商业使用。如果计划将天工模型或其衍生产品用于商业目的，必须遵守天工社区许可证中的条款和条件。

📞 联系我们

如果有任何问题或反馈，请随时通过 shiwen.tu@kunlun-inc.com 或 liang.zhao@kunlun-inc.com 联系我们。该项目由Liang Zhao领导。

📚 引用

如果您觉得我们的工作有帮助，请使用以下BibTeX条目引用我们：

@misc{skyworkcritic2024,
  title={Skywork Critic Model Series},
  author={Shiwen, Tu and Liang, Zhao and Liu, Chris Yuhao and Zeng, Liang and Liu, Yang},
  year={2024},
  month={September},
  howpublished={\url{https://huggingface.co/Skywork}},
  url={https://huggingface.co/Skywork},
}