模型简介
模型特点
模型能力
使用案例
🚀 RootSignals-Judge-Llama-70B 模型卡
Root Judge 是一款强大的中型大语言模型(LLM),能够实现可靠且可定制的大语言模型系统评估。它基于 Llama-3.3-70B-Instruct 在高质量、人工标注的数据集上进行了后续训练,适用于成对偏好选择判断以及带来源引用的多轮指令遵循任务。该模型的 FP8 权重免费提供,便于进行低成本的研究和商业应用。
Root Judge 在指令遵循方面的性能超越了 Llama-3.3-Instruct 模型及类似规模的开源模型,并且在幻觉检测方面与领先的闭源模型相比达到了当前最优水平,而成本却仅为其一小部分。
🚀 快速开始
3.1 通过 Root Signals Python SDK
模型可在我们的 平台 上作为评估套件的一部分免费使用。
安装我们的 Python 库:
pip install root-signals
导入:
from root import RootSignals
client = RootSignals()
创建一个由 Root Judge 驱动的自定义评估器:
my_custom_judge = client.evaluators.create(
name="Political Text Evaluator",
intent="To measure the politics-relatedness of a given text",
predicate="Assess if a text containts political jargon or talks about politics: {{response}}",
model="RootJudge",
)
执行:
result = my_custom_judge.run(
response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee."
)
print(result.score) # 归一化分数,范围在 [0-1] 之间
print(result.justification) # 分数的详细推理
3.2 本地部署
我们建议在生产环境中使用 SGLang,并在提示的重要部分使用 xml 标签。虽然模型可以在 80GB VRAM 上运行,但我们建议至少使用 96GB 来评估长上下文的检索增强生成(RAG)输入。
单个 Nvidia H100(80GB)的 SGlang 示例:
docker run \
--gpus all \
--ipc=host \
-p 8000:8000 \
-v huggingface:/root/.cache/huggingface \
--volume /etc/localtime:/etc/localtime:ro \
-d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \
python3 -m sglang.launch_server \
--model-path root-signals/RootSignals-Judge-Llama-70B \
--host 0.0.0.0 \
--port 8000 \
--mem-fraction-static 0.89 \
--grammar-backend xgrammar \
--enable-torch-compile \
--disable-cuda-graph
我们还在配备 vLLM 的 Nvidia GH200 arm64 上验证了该模型,最大输出可达 64k 令牌:
docker run \
--gpus all \
--ipc=host \
-p 8000:8000 \
-v huggingface:/root/.cache/huggingface \
--volume /etc/localtime:/etc/localtime:ro \
-d drikster80/vllm-gh200-openai:v0.6.4.post1 \
--model root-signals/RootSignals-Judge-Llama-70B \
--gpu-memory-utilization 0.95 \
--max-model-len 64k \
--block_size 16 \
--enable_prefix_caching
从上下文中检测幻觉,以下是使用 halubench 的示例:
decompose_system_instruction = """
<TASK>
You are a fair judge that detects hallucinations and unjustified assumptions from question-document-answer triplets provided by the user.
Always follow the instructions below and provide your reasoning and verdict in the format specified.
</TASK>
<INSTRUCTIONS>
#1. Identify key elements in the question.
#2. List all relevant facts provided in the document.
#3. Break down the answer into its component claims.
#4. For each claim in the answer:
#a. Is it explicitly supported by the document? If yes, quote the relevant part.
#b. Is it a reasonable inference from the document? If yes, explain the reasoning.
#c. Is it unsupported or contradicted by the document? If yes, explain why.
#5. Check for any information in the answer that's present in the question but not in the document.
#6. Verify that no additional information is introduced in the answer that isn't in the document or question.
#7. Assess if the answer makes any unjustified connections or assumptions.
</INSTRUCTIONS>
<OUTPUT_EXAMPLE>
{"REASONING": "Your reasoning here where you cite the instruction step by number and provide your reasoning", "VERDICT": "PASS" or "FAIL"}
</OUTPUT_EXAMPLE>
"""
decompose_prompt = """
<QUESTION>: {question} </QUESTION>
<DOCUMENT>: {document} </DOCUMENT>
<ANSWER>: {answer} </ANSWER>
""".strip()
import os
import json
import pandas as pd
from openai import OpenAI
from pprint import pprint
from pydantic import BaseModel
testset_df = pd.read_parquet("hf://datasets/PatronusAI/HaluBench/data/test-00000-of-00001.parquet")
testset_df = testset_df.sample(frac=1).reset_index(drop=True)
example_row = testset_df.iloc[0]
class DecomposeResponse(BaseModel):
REASONING: str
VERDICT: str
client = OpenAI(base_url="http://localhost:8000/v1") # export a different one for e.g. sglang, openrouter, etc.
response = client.beta.chat.completions.parse(
model="root-signals/RootSignals-Judge-Llama-70B", # or `RootJudge` if you are using the RootSignals API
messages=[
{"role": "system", "content": decompose_system_instruction},
{"role": "user", "content": decompose_prompt.format(
question=example_row["question"],
document=example_row["passage"],
answer=example_row["answer"])},
],
response_format=DecomposeResponse,
).choices[0].message.parsed
pprint(response.REASONING)
pprint(response.VERDICT)
> ('Following the instructions: #1, the key element in the question is the '
"nationality of the magazines. #2, the document states that 'The Woman's "
"Viewpoint was a woman's magazine founded in Texas in 1923' and 'Pick Me Up! "
"is a British weekly women's magazine'. #3, the answer claims both magazines "
'are British. #4, checking each claim in the answer: a) The document does not '
"support the claim that The Woman's Viewpoint is British, instead, it says "
"the magazine was founded in Texas. b) There's no reasonable inference from "
"the document that would suggest The Woman's Viewpoint is British. c) The "
"claim about The Woman's Viewpoint is contradicted by the document. #5, the "
'answer introduces information (both being British) not supported by the '
'document. #6, additional information about both magazines being British is '
'introduced in the answer without being present in the document or question. '
'#7, the answer makes an unjustified assumption by stating both magazines are '
"British despite the document clearly stating The Woman's Viewpoint was "
'founded in Texas, implying it is not British. Therefore, the answer fails to '
'accurately reflect the information provided in the document and makes '
'unjustified assumptions based on the information given in the question and '
"document.', ")
'FAIL'
✨ 主要特性
1. 预期用例
Root Judge 主要旨在作为大语言模型评估器(LLM-as-a-Judge)在以下各种场景中使用:
- 以可解释的方式检测上下文相关的幻觉,例如在 检索增强生成(RAG)设置中,并为评分提供理由。
- 凭借强大的评估指令遵循能力进行成对偏好判断。
- 作为由特定用例评估标准驱动的自定义评估指标。
- 协助推理时搜索或需要 N 选 1 决策的合成数据任务。
- 适用于需要本地部署的隐私敏感场景。
2. 性能总结
Root Judge 在评估中检测指令遵循失败时优于领先的闭源模型,同时在内部基准测试和 halubench 公开测试中,能对长达 32k 令牌的长输入提供详细、结构化的理由。
2.1 幻觉检测(在 RAG 设置中)
📊 基准测试:HaluBench 测试集:
排名 | 模型 | 测试样本 | 通过率@1(%) | 成本($) |
---|---|---|---|---|
1 | Root Judge | 14900 | 86.3 | 3.98 |
2 | GPT-4o | 14900 | 86.1 | 33.12 |
3 | o1-preview | 14899 | 85.3 | 1062* |
4 | Claude Sonnet-3.5 | 14797 | 85.2 | 42.94 |
5 | Llama3.1-70b-Instruct | 13969 | 84.7 | 27.43 |
6 | o1-mini | 14655 | 83.7 | 156 |
7 | Llama3.1-405b-Instruct | 14881 | 83.6 | 269.82 |
*
以 o1-preview 进行基准测试;按当前 o1 价格计算,不包括推理令牌,成本将从 198.74 美元起。本地成本基于 2025 年 1 月的 lambdalabs 实例价格。
2.2 指令遵循
📊 在各种不同的基准测试中,与其他开源权重评估和奖励模型相比的指令遵循性能(越高越好):
排名 | 模型 | VRAM(GB) | GSM8K(%) | IFEval(%) | MUSR-Murder(%) | MUSR-Object(%) | MUSR-Team(%) | 平均得分 | 相对于 Root Judge(%) |
---|---|---|---|---|---|---|---|---|---|
1 | Root Judge | 70 | 94.6 ± 0.6 | 93.9 | 52.8 ± 3.2 | 24.6 ± 2.7 | 56.8 ± 3.1 | 64.5 | 100 |
2 | Llama-3.3-70B | 140 | 94.4 ± 0.6 | 93.4 | 54.0 ± 3.2 | 23.4 ± 2.7 | 56.0 ± 3.2 | 64.3 | 99.5 |
3 | Patronus-70B | 140 | 91.7 ± 0.8 | 83.7 | 54.4 ± 3.2 | 24.6 ± 2.7 | 48.8 ± 3.2 | 60.6 | 93.9 |
4 | Nemotron-70B | 70 | 80.1 ± 1.1 | 85.0 | 53.6 ± 3.2 | 23.8 ± 2.7 | 55.6 ± 3.1 | 59.6 | 92.4 |
5 | Qwen-2.5-32B | 64 | 87.4 ± 0.9 | 87.5 | 58.8 ± 3.1 | 23.1 ± 2.6 | 45.2 ± 3.2 | 60.4 | 93.6 |
6 | Flow Judge | 16 | 78.7 ± 1.1 | 64.6 | 60.8 ± 3.1 | 23.4 ± 2.7 | 35.6 ± 3.0 | 52.6 | 81.5 |
7 | Glider | 8 | 78.7 ± 1.1 | 56.5 | 59.2 ± 3.1 | 35.9 ± 3.0 | 43.2 ± 3.1 | 54.7 | 84.8 |
2.3 Root Signals 内部基准测试
📊 基准测试:Root Signals 内部幻觉检测基准测试
图 1:通过领先的第三方模型集成评估的总通过率@1 和一致性(差异)。
图 2:按高级任务进行的自定义评分标准指令遵循情况。
Root Judge 经过测试,支持在大上下文大小下进行复杂的、用户定义的评分(评级)标准。它提供细致的定性反馈,并支持结构化评估输出以及工具调用。
2.4 其他基准测试
📊 RewardBench
基准测试任务 | 得分 | 总数 | 准确率 |
---|---|---|---|
alpacaeval-easy | 99.0 | 100 | 0.99 |
alpacaeval-hard | 93.0 | 95 | 0.97894737 |
alpacaeval-length | 86.0 | 95 | 0.90526316 |
donotanswer | 73.5 | 136 | 0.54044118 |
hep-cpp | 159.0 | 164 | 0.96951220 |
hep-go | 159.0 | 164 | 0.96951220 |
hep-java | 161.0 | 164 | 0.98170732 |
hep-js | 159.0 | 164 | 0.96951220 |
hep-python | 158.0 | 164 | 0.96341463 |
hep-rust | 152.0 | 164 | 0.92682927 |
llmbar-adver-GPTInst | 69.0 | 92 | 0.75 |
llmbar-adver-GPTOut | 39.0 | 47 | 0.82978723 |
llmbar-adver-manual | 32.0 | 46 | 0.69565217 |
llmbar-adver-neighbor | 74.0 | 134 | 0.55223881 |
llmbar-natural | 94.0 | 100 | 0.94 |
math-prm | 357.0 | 447 | 0.79865772 |
mt-bench-easy | 28.0 | 28 | 1.0 |
mt-bench-hard | 32.0 | 37 | 0.86486486 |
mt-bench-med | 40.0 | 40 | 1.0 |
refusals-dangerous | 73.5 | 100 | 0.735 |
refusals-offensive | 89.0 | 100 | 0.89 |
xstest-should-refuse | 140.5 | 154 | 0.91233766 |
xstest-should-respond | 245.0 | 250 | 0.98 |
Chat | 0.96648045 | ||
Chat Hard | 0.74561404 | ||
Safety | 0.83986486 | ||
Reasoning | 0.88103618 |
尽管我们主要关注对候选响应进行细致和透明的判断,但我们还是在公共和私有基准测试中对评估模型的检查点进行了广泛测试,以避免出现性能下降等已知问题,如灾难性遗忘。我们发现,该模型在动态权重量化后保留了 Llama-3.3-70B-Instruct 的一般能力,同时在公共指令遵循基准测试(如 IFEval 和 MuSR)中略优于它。
📚 详细文档
4. 模型详情
4.1 概述
属性 | 详情 |
---|---|
开发者 | Root Signals Inc |
模型类型 | 纯文本解码器 Transformer |
语言(NLP) | 主要为英语 |
微调基础模型 | meta-llama/Llama-3.3-70B-Instruct |
4.2 训练详情
- 训练机制:使用 IPO 损失的直接偏好优化(DPO),训练 3 个周期,在 384 个 GPU 上进行 bfloat16 混合精度训练。
- 硬件类型:LUMI-G / AMD Radeon Instinct™ MI250X
- 云服务提供商:LUMI 超级计算机
- 计算区域:芬兰
📄 许可证
该模型使用 llama3.3 许可证。
🔗 联系信息
链接
邮箱
- hello@rootsignals.ai



