Skywork-Critic-Llama-3.1-70B開源評判模型 - 免費對比文本對，評估質量適用性

首頁

Skywork Critic Llama 3.1 70B

由Skywork開發

天工評判系列模型由天工AI對齊團隊開發，包含70B和8B兩款先進的評判模型，擅長進行成對偏好評估，能夠對輸入的文本對進行細緻比較，判斷其相對質量或適用性。

大型語言模型

PyTorch

開源協議:其他 #成對偏好評估 #獎勵建模 #高質量數據微調

下載量 1,413

發布時間 : 9/19/2024

模型概述

天工評判系列模型基於Meta的Llama-3.1系列模型進行微調，專注於成對偏好評估和一般聊天任務，在數據改進、評估和獎勵建模等應用場景中具有重要價值。

模型特點

成對偏好評估

能夠對輸入的文本對進行細緻比較，判斷其相對質量或適用性。

多場景應用

可用於數據改進、評估和獎勵建模等多種應用場景。

高性能表現

在RewardBench排行榜上取得了優異的成績，70B版本在所有規模的生成模型中排名第一。

模型能力

文本對質量評估

偏好數據選擇

指令-響應對評分

多維度評判分析

使用案例

數據改進

DPO訓練數據選擇

用於區分直接偏好優化（DPO）訓練中的選擇和拒絕的訓練數據。

提高模型訓練數據的質量

模型評估

響應質量評估

對AI助手的響應進行多維度評分和分析。

提供詳細的評估報告和改進建議

🚀 天工評判系列模型介紹

天工評判系列模型由天工AI對齊團隊開發，包含 天工評判-Llama3.1-70B 和 天工評判-Llama3.1-8B 兩款先進的評判模型。這些模型擅長進行成對偏好評估，能夠對輸入的文本對進行細緻比較，判斷其相對質量或適用性。憑藉對語言和上下文的深度理解，天工評判模型在數據改進、評估和獎勵建模等應用場景中具有重要價值。

🤗 Hugging Face • 🤖 ModelScope

🚀 快速開始

天工評判系列模型可用於多種自然語言處理任務，如數據改進、評估和獎勵建模等。以下將詳細介紹模型的訓練細節、評估結果、使用示例以及相關聲明和許可協議。

✨ 主要特性

成對偏好評估：能夠對輸入的文本對進行細緻比較，判斷其相對質量或適用性。
多場景應用：可用於數據改進、評估和獎勵建模等多種應用場景。
高性能表現：在RewardBench排行榜上取得了優異的成績，天工評判-Llama3.1-70B在所有規模的生成模型中排名第一，天工評判-Llama3.1-8B在參數小於10B的生成模型中排名第一。

📦 安裝指南

暫未提供相關安裝步驟，可參考模型的官方倉庫進行安裝。

💻 使用示例

基礎用法

以下是將天工評判模型作為偏好數據選擇器的示例代碼，用於區分直接偏好優化（DPO）訓練中的選擇和拒絕的訓練數據。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# An Example Case
prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
responseA = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
responseB = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."

# feed a natural language prompt to generative model
prompt_template = """Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user\'s instructions and answers the user\'s question better. 
Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. 
Please directly output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better.

[User Question]
{input}

[The Start of Assistant A's Answer]
{response_a}
[The End of Assistant A's Answer]

[The Start of Assistant B's Answer]
{response_b}
[The End of Assistant B's Answer]
"""

user_message = prompt_template.format(input=prompt, response_a=responseA, response_b=responseB)

conversation = [{"role": "user", "content": user_message}]

model_name = "Skywork/Skywork-Critic-Llama3.1-70B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_ids = tokenizer.apply_chat_template(
    conversation, 
    tokenize=True, 
    add_generation_prompt=True,
    return_tensors="pt").to(model.device)

generation = model.generate(
    input_ids=input_ids,
    max_new_tokens=2048,
    do_sample=False,
    pad_token_id=128009,
    temperature=0)

completion = tokenizer.decode(
    generation[0][len(input_ids[0]):], 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=True)

print(completion)

# Output:
# The generative model should output "[[A]]"

高級用法

以下是將天工評判模型作為評判器的示例代碼，用於為指令-響應對生成評分和理由。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# An Example Case
prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"

# Chosen Response
responseA = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."

# Rejected Response
responseB = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."

# feed a natural language prompt to generative model
single_rating_prompt_template = """請扮演一個專業的對話質量評價專家，能夠從多個維度診斷和分析 AI 助手對用戶問題的回答，並進行總體打分（分值範圍是 1-5）。你的評估應考慮回答的有用性、相關性、準確性、深度、創造性、安全性等眾多維度，請注意，不同任務類型的指令對評估分析維度的側重不一樣，需要根據具體的問題進行分析。

[用戶問題]
{input}

[助手回答的開始]
{response_a}
[助手回答的結束]

你的詳細評估和總體打分為: """

# Critic responseA and responseB separately.
user_message = single_rating_prompt_template.format(input=prompt, response_a=responseA)
# user_message = single_rating_prompt_template.format(input=prompt, response_a=responseB)


conversation = [{"role": "user", "content": user_message}]

print(f"conversation => {conversation}")


model_name = "70B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_ids = tokenizer.apply_chat_template(
    conversation, 
    tokenize=True, 
    add_generation_prompt=True, 
    return_tensors="pt").to(model.device)

generation = model.generate(
    input_ids=input_ids,
    max_new_tokens=2048,
    do_sample=False,
    pad_token_id=128009,
    temperature=0)

completion = tokenizer.decode(
    generation[0][len(input_ids[0]):], 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=True)


print(completion)

模型生成的選擇響應A的評判

**評估分析：**

1. **有用性**：回答直接解決了用戶的問題，提供了清晰的步驟和計算過程，非常有用。
2. **相關性**：回答完全相關於用戶的問題，沒有偏離主題。
3. **準確性**：計算過程準確無誤，結果正確。
4. **深度**：回答提供了詳細的步驟和解釋，幫助用戶理解問題的解決過程。
5. **創造性**：雖然回答是直接的計算過程，但在解釋每一步時使用了簡單的語言和明確的計算，使得回答易於理解。
6. **安全性**：回答中沒有包含任何可能引起誤解或危險的信息。

**總體打分：5**

**總結**：這個回答非常清晰、準確且有用，完全滿足了用戶的需求。它通過逐步解釋和計算，幫助用戶理解問題的解決過程。沒有發現任何需要改進的地方，因此給予滿分。

模型生成的拒絕響應B的評判

**評估分析：**

1. **有用性**：回答直接解決了用戶的問題，提供了每個人最終得到的蘋果數量。
2. **相關性**：回答緊密相關於用戶的問題，沒有偏離主題。
3. **準確性**：回答中的計算過程是正確的，但在最後一步中出現了錯誤。Jane 分蘋果時，應該是將蘋果分給自己和她的2個兄弟姐妹，總共3個人，而不是2個人。
4. **深度**：回答提供了詳細的計算步驟，幫助用戶理解問題的解決過程。
5. **創造性**：回答是直接的，沒有特別的創造性，但對於這個簡單的數學問題來說，直接的回答是合適的。
6. **安全性**：回答沒有包含任何可能引起誤解或危險的信息。

**總體打分：** 4

**改進建議：**
- 在最後一步中，正確地計算每個人得到的蘋果數量。Jane 應該將 9 個蘋果分給自己和她的 2 個兄弟姐妹，總共 3 個人。因此，每個人得到的蘋果數量應該是 9 ÷ 3 = 3 個蘋果。

📚 詳細文檔

訓練細節

天工評判-Llama3.1-70B和天工評判-Llama3.1-8B分別基於Meta的 Llama-3.1-70B-Instruct 和 Llama-3.1-8B-Instruct 構建。這些模型使用了多種高質量數據集進行微調，包括：

清理後的開源數據：使用了 HelpSteer2、OffsetBias、WildGuard (adversarial) 和Magpie DPO系列（Ultra、Pro (Llama-3.1)、Pro、Air）等數據集的高質量子集。更多詳細信息，請參考 Skywork-Reward-Preference-80K-v0.1數據集。此外，還將一些開源的高質量評判數據集，如 Open-Critic-GPT 集成到訓練過程中。
內部人工標註數據：包括對單個響應的多個維度進行逐點評分以及兩個響應之間的成對比較。每個維度都包含了評分的理由。需要注意的是，手動標註數據的獲取成本非常高，只有幾百個手動標註的數據點，且全部為中文，因此進行單評分的能力可能不是特別強。
合成評判數據：使用了類似於 self-taught 的方法。具體來說，採用了兩種方法為給定的指令生成較差的響應：1) 創建一個類似的指令，然後為這個新指令生成響應。2) 在高質量響應中引入細微的錯誤。
與評判相關的聊天數據：納入與評判相關的聊天數據，以保持模型的對話能力。

訓練採用指令微調方法，專注於成對偏好評估和一般聊天任務。進行了全面的驗證過程，以確保訓練數據集不包含RewardBench的任何測試集信息，維護評估結果的完整性。

RewardBench生成模型排行榜

使用官方測試腳本在 RewardBench 上對模型進行評估。截至2024年9月，天工評判-Llama3.1-70B在所有規模的生成模型中排名第一，天工評判-Llama3.1-8B在參數小於10B的生成模型中排名第一。（注：星號 (*) 表示開源模型。）

模型	聊天	困難聊天	安全性	推理	總體得分
天工評判-Llama3.1-70B *	96.6	87.9	93.1	95.5	93.3
Salesforce/SFR-LLaMa-3.1-70B-Judge-r	96.9	84.8	91.6	97.6	92.7
Salesforce/SFR-nemo-12B-Judge-r	97.2	82.2	86.5	95.1	90.3
天工評判-Llama3.1-8B *	93.6	81.4	91.1	89.8	89.0
Salesforce/SFR-LLaMa-3.1-8B-Judge-r	95.5	77.7	86.2	95.1	88.7
facebook/Self-taught-Llama-3-70B	96.9	84.0	91.1	82.5	88.6
google/gemini-1.5-pro-0514	92.3	80.6	87.9	92.0	88.2
openai/gpt-4o-2024-08-06	96.1	76.1	88.1	86.6	86.7
openai/gpt-4-0125-preview	95.3	74.3	87.6	86.9	86.0
openai/gpt-4-turbo-2024-04-09	95.3	75.4	87.6	82.7	85.2
Anthropic/claude-3-5-sonnet-20240620	96.4	74.0	81.6	84.7	84.2
meta-llama/Meta-Llama-3.1-70B-Instruct *	97.2	70.2	82.8	86.0	84.0
NCSOFT/Llama-3-OffsetBias-8B *	92.5	80.3	86.8	76.4	84.0

🔧 技術細節

模型基於Meta的Llama-3.1系列模型進行微調，採用指令微調方法，專注於成對偏好評估和一般聊天任務。訓練過程中使用了多種高質量數據集，包括清理後的開源數據、內部人工標註數據、合成評判數據和與評判相關的聊天數據。通過全面的驗證過程，確保訓練數據集不包含RewardBench的任何測試集信息，維護評估結果的完整性。

📄 許可證

聲明

聲明天工模型不得用於任何對國家或社會安全構成威脅的活動或從事非法行為。此外，要求用戶在未進行適當的安全審查和記錄的情況下，不要將天工模型部署到互聯網服務中。希望所有用戶遵守這一原則，確保技術進步在規範和合法的環境中進行。

儘管已盡最大努力確保模型訓練過程中使用的數據的合規性，但由於模型和數據的複雜性，仍可能存在不可預測的風險和問題。因此，如果因使用天工開源模型而出現任何問題，包括但不限於數據安全問題、輿論風險或因模型被誤導、濫用、傳播或不當使用而產生的任何風險和問題，將不承擔任何責任。

許可協議

天工模型的社區使用需要遵循天工社區許可證。天工模型支持商業使用。如果計劃將天工模型或其衍生產品用於商業目的，必須遵守天工社區許可證中的條款和條件。

📞 聯繫我們

如果有任何問題或反饋，請隨時通過 shiwen.tu@kunlun-inc.com 或 liang.zhao@kunlun-inc.com 聯繫我們。該項目由Liang Zhao領導。

📚 引用

如果您覺得我們的工作有幫助，請使用以下BibTeX條目引用我們：

@misc{skyworkcritic2024,
  title={Skywork Critic Model Series},
  author={Shiwen, Tu and Liang, Zhao and Liu, Chris Yuhao and Zeng, Liang and Liu, Yang},
  year={2024},
  month={September},
  howpublished={\url{https://huggingface.co/Skywork}},
  url={https://huggingface.co/Skywork},
}