Skywork-Critic-Llama-3.1-70B Open-source Judgment Model - Freely compare text pairs and evaluate quality and applicability

Skywork Critic Llama 3.1 70B

Developed by Skywork

The Skywork Critic series of models are developed by the Skywork AI Alignment Team. It includes two advanced critic models, 70B and 8B, which are good at paired preference evaluation and can conduct detailed comparisons of input text pairs to judge their relative quality or applicability.

Large Language Model

PyTorch

Open Source License:Other #Paired preference evaluation #Reward modeling #Fine-tuning with high-quality data

Downloads 1,413

Release Time : 9/19/2024

Model Overview

The Skywork Critic series of models are fine-tuned based on Meta's Llama-3.1 series of models, focusing on paired preference evaluation and general chat tasks. They are of great value in application scenarios such as data improvement, evaluation, and reward modeling.

Model Features

Paired preference evaluation

It can conduct detailed comparisons of input text pairs to judge their relative quality or applicability.

Multi-scenario application

It can be used in various application scenarios such as data improvement, evaluation, and reward modeling.

High-performance performance

It has achieved excellent results on the RewardBench leaderboard. The 70B version ranks first among all-scale generative models.

Model Capabilities

Text pair quality evaluation

Preference data selection

Instruction - response pair scoring

Multi-dimensional judgment analysis

Use Cases

Data improvement

DPO training data selection

Used to distinguish between selected and rejected training data in direct preference optimization (DPO) training.

Improve the quality of model training data

Model evaluation

Response quality evaluation

Conduct multi-dimensional scoring and analysis of the responses of AI assistants.

Provide detailed evaluation reports and improvement suggestions

🚀 Skywork Critic Series Models

The Skywork Critic Series Models, developed by the SkyworkAI Alignment Team, are advanced judge models designed for pairwise preference evaluation. These models compare input pairs and provide nuanced judgments on their relative quality, offering valuable insights for data improvement, evaluation, and reward modeling.

✨ Features

Pairwise Preference Evaluation: Excel at comparing and assessing input pairs to determine relative quality.
Diverse Training Data: Trained on a wide range of high - quality datasets, including cleaned open - source data, in - house human annotation data, synthetic critic data, and critic - related chat data.
High Performance on RewardBench: Skywork - Critic - Llama3.1 - 70B ranks first on RewardBench for generative models of all sizes, and Skywork - Critic - Llama3.1 - 8B tops the list for models under 10B parameters.
Multiple Usage Scenarios: Can be used as a preference data selector and a judge for instruction - response pairs.

📦 Installation

No installation steps were provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Skywork Critic Model as a Preference Data Selector

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# An Example Case
prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
responseA = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
responseB = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."

# feed a natural language prompt to generative model
prompt_template = """Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user\'s instructions and answers the user\'s question better. 
Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. 
Please directly output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better.

[User Question]
{input}

[The Start of Assistant A's Answer]
{response_a}
[The End of Assistant A's Answer]

[The Start of Assistant B's Answer]
{response_b}
[The End of Assistant B's Answer]
"""

user_message = prompt_template.format(input=prompt, response_a=responseA, response_b=responseB)

conversation = [{"role": "user", "content": user_message}]

model_name = "Skywork/Skywork-Critic-Llama3.1-70B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_ids = tokenizer.apply_chat_template(
    conversation, 
    tokenize=True, 
    add_generation_prompt=True,
    return_tensors="pt").to(model.device)

generation = model.generate(
    input_ids=input_ids,
    max_new_tokens=2048,
    do_sample=False,
    pad_token_id=128009,
    temperature=0)

completion = tokenizer.decode(
    generation[0][len(input_ids[0]):], 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=True)

print(completion)

# Output:
# The generative model should output "[[A]]"

Skywork Critic Model as a Judge

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# An Example Case
prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"

# Chosen Response
responseA = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."

# Rejected Response
responseB = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."

# feed a natural language prompt to generative model
single_rating_prompt_template = """请扮演一个专业的对话质量评价专家，能够从多个维度诊断和分析 AI 助手对用户问题的回答，并进行总体打分（分值范围是 1-5）。你的评估应考虑回答的有用性、相关性、准确性、深度、创造性、安全性等众多维度，请注意，不同任务类型的指令对评估分析维度的侧重不一样，需要根据具体的问题进行分析。

[用户问题]
{input}

[助手回答的开始]
{response_a}
[助手回答的结束]

你的详细评估和总体打分为: """

# Critic responseA and responseB separately.
user_message = single_rating_prompt_template.format(input=prompt, response_a=responseA)
# user_message = single_rating_prompt_template.format(input=prompt, response_a=responseB)


conversation = [{"role": "user", "content": user_message}]

print(f"conversation => {conversation}")


model_name = "70B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_ids = tokenizer.apply_chat_template(
    conversation, 
    tokenize=True, 
    add_generation_prompt=True, 
    return_tensors="pt").to(model.device)

generation = model.generate(
    input_ids=input_ids,
    max_new_tokens=2048,
    do_sample=False,
    pad_token_id=128009,
    temperature=0)

completion = tokenizer.decode(
    generation[0][len(input_ids[0]):], 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=True)


print(completion)

Advanced Usage

The advanced usage is demonstrated through the model - generated judgments for different responses:

Model - Generated Judgment for Chosen Response A

**Evaluation Analysis**:

1. **Usefulness**: The answer directly solves the user's problem, providing clear steps and calculation processes, which is very useful.
2. **Relevance**: The answer is completely relevant to the user's question without deviating from the topic.
3. **Accuracy**: The calculation process is accurate, and the result is correct.
4. **Depth**: The answer provides detailed steps and explanations to help the user understand the problem - solving process.
5. **Creativity**: Although the answer is a direct calculation process, simple language and clear calculations are used in explaining each step, making the answer easy to understand.
6. **Safety**: The answer does not contain any information that may cause misunderstanding or danger.

**Overall Score**: 5

**Summary**: This answer is very clear, accurate, and useful, fully meeting the user's needs. It helps the user understand the problem - solving process through step - by - step explanations and calculations. No areas for improvement are found, so a full score is given.

Model - Generated Judgment for Rejected Response B

**Evaluation Analysis**:

1. **Usefulness**: The answer directly solves the user's problem, providing the number of apples each person finally gets.
2. **Relevance**: The answer is closely relevant to the user's question without deviating from the topic.
3. **Accuracy**: The calculation process in the answer is correct, but an error occurs in the last step. When Jane distributes the apples, she should distribute them among herself and her 2 siblings, a total of 3 people, not 2 people.
4. **Depth**: The answer provides detailed calculation steps to help the user understand the problem - solving process.
5. **Creativity**: The answer is straightforward without special creativity, but for this simple math problem, a direct answer is appropriate.
6. **Safety**: The answer does not contain any information that may cause misunderstanding or danger.

**Overall Score**: 4

**Improvement Suggestions**:
- In the last step, correctly calculate the number of apples each person gets. Jane should distribute 9 apples among herself and her 2 siblings, a total of 3 people. Therefore, each person should get 9 ÷ 3 = 3 apples.

📚 Documentation

Training Details

Skywork - Critic - Llama3.1 - 70B and Skywork - Critic - Llama3.1 - 8B are built on Meta [Llama - 3.1 - 70B - Instruct](https://huggingface.co/meta - llama/Meta - Llama - 3 - 70B - Instruct) and [Llama - 3.1 - 8B - Instruct](https://huggingface.co/meta - llama/Meta - Llama - 3 - 8B - Instruct) respectively. These models have undergone fine - tuning using a diverse array of high - quality datasets:

Cleaned open - source data: A high - quality subset of HelpSteer2, OffsetBias, WildGuard (adversarial) and Magpie DPO series([Ultra](https://huggingface.co/datasets/argilla/magpie - ultra - v0.1),[Pro (Llama - 3.1)](https://huggingface.co/datasets/Magpie - Align/Magpie - Llama - 3.1 - Pro - DPO - 100K - v0.1),[Pro](https://huggingface.co/datasets/Magpie - Align/Magpie - Pro - DPO - 100K - v0.1),[Air](https://huggingface.co/datasets/Magpie - Align/Magpie - Air - DPO - 100K - v0.1)) is used. For more details, refer to [Skywork - Reward - Preference - 80K - v0.1 dataset](https://huggingface.co/datasets/Skywork/Skywork - Reward - Preference - 80K - v0.1). Additionally, open - source, high - quality critic datasets like [Open - Critic - GPT](https://huggingface.co/datasets/Vezora/Open - Critic - GPT) are integrated into the training process.
In - house human annotation data: This includes both pointwise scoring across many dimensions for a single response and pairwise comparisons between two responses. Each dimension incorporates a rationale for the assigned score. Note that manually labeled data is very expensive to obtain. There are only a few hundred manually labeled data points, all in Chinese, so the ability to perform single rating might not be particularly strong.
Synthetic critic data: A similar approach to self - taught is used. Specifically, two methods are employed to generate inferior responses for a given instruction: 1) Creating a similar instruction and then generating a response for this new instruction. 2) Introducing subtle errors into high - quality responses.
Critic - related chat data: This data is incorporated to maintain the model's conversational capabilities.

The training uses instruction - tuning methodology, focusing on pairwise preference evaluation and general chat tasks. A thorough verification process is conducted to ensure the training dataset does not contain any test set information from RewardBench, maintaining the integrity of the evaluation results.

RewardBench Leaderboard for Generative Models

The models are evaluated on [RewardBench](https://huggingface.co/spaces/allenai/reward - bench) using the [official test script](https://github.com/allenai/reward - bench).

As of September 2024, Skywork - Critic - Llama3.1 - 70B ranks first on RewardBench for generative models across all sizes, while Skywork - Critic - Llama3.1 - 8B tops the list for generative models under 10B parameters. (Note: An asterisk (*) indicates an open - source model.)

Model	Chat	Chat Hard	Safety	Reasoning	Overall Score
Skywork - Critic - Llama3.1 - 70B *	96.6	87.9	93.1	95.5	93.3
Salesforce/SFR - LLaMa - 3.1 - 70B - Judge - r	96.9	84.8	91.6	97.6	92.7
Salesforce/SFR - nemo - 12B - Judge - r	97.2	82.2	86.5	95.1	90.3
Skywork - Critic - Llama3.1 - 8B *	93.6	81.4	91.1	89.8	89.0
Salesforce/SFR - LLaMa - 3.1 - 8B - Judge - r	95.5	77.7	86.2	95.1	88.7
facebook/Self - taught - Llama - 3 - 70B	96.9	84.0	91.1	82.5	88.6
google/gemini - 1.5 - pro - 0514	92.3	80.6	87.9	92.0	88.2
openai/gpt - 4o - 2024 - 08 - 06	96.1	76.1	88.1	86.6	86.7
openai/gpt - 4 - 0125 - preview	95.3	74.3	87.6	86.9	86.0
openai/gpt - 4 - turbo - 2024 - 04 - 09	95.3	75.4	87.6	82.7	85.2
Anthropic/claude - 3 - 5 - sonnet - 20240620	96.4	74.0	81.6	84.7	84.2
meta - llama/Meta - Llama - 3.1 - 70B - Instruct *	97.2	70.2	82.8	86.0	84.0
NCSOFT/Llama - 3 - OffsetBias - 8B *	92.5	80.3	86.8	76.4	84.0

🔧 Technical Details

No specific technical details beyond the training and evaluation information were provided, so this section is skipped.

📄 License

Declaration

The Skywork model should not be used for any activities that pose a threat to national or societal security or engage in unlawful actions. Users are requested not to deploy the Skywork model for internet services without appropriate security reviews and records. The developers have made every effort to ensure the compliance of the data used during the model's training process. However, due to the complexity of the model and data, there may still be unpredictable risks and issues. Therefore, the developers will not assume any responsibility for problems arising from using the Skywork open - source model, including but not limited to data security issues, public opinion risks, or any risks and problems arising from the model being misled, abused, disseminated, or improperly utilized.

License Agreement

The community usage of the Skywork model requires [Skywork Community License](https://github.com/SkyworkAI/Skywork - Reward/blob/main/misc/Skywork%20Community%20License.pdf). The Skywork model supports commercial use. If you plan to use the Skywork model or its derivatives for commercial purposes, you must abide by the terms and conditions within [Skywork Community License](https://github.com/SkyworkAI/Skywork - Reward/blob/main/misc/Skywork%20Community%20License.pdf).

📞 Contact

If you have any questions or feedback, don't hesitate to reach out to the friendly team at <shiwen.tu@kunlun - inc.com> or <liang.zhao@kunlun - inc.com>. Liang Zhao leads this project.

📚 Citation

If you find the work helpful, please feel free to cite using the following BibTeX entry:

@misc{skyworkcritic2024,
  title={Skywork Critic Model Series},
  author={Shiwen, Tu and Liang, Zhao and Liu, Chris Yuhao and Zeng, Liang and Liu, Yang},
  year={2024},
  month={September},
  howpublished={\url{https://huggingface.co/Skywork}},
  url={https://huggingface.co/Skywork},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご