RootSignals-Judge-Llama-70B开源大语言模型 - 免费部署用于LLM系统可靠评估

首页

Rootsignals Judge Llama 70B

由 root-signals 开发

Root Judge是一款强大的中型大语言模型，专为可靠且可定制的LLM系统评估而设计。基于Llama-3.3-70B-Instruct微调，擅长成对偏好判断和带来源引用的多轮指令遵循任务。

大型语言模型

Safetensors

英语#幻觉检测 #指令遵循评估 #RAG质量评估

下载量 620

发布时间 : 2/5/2025

模型简介

Root Judge是一款专注于大语言模型评估的中型模型，在幻觉检测和指令遵循方面表现优异，支持本地部署和低成本应用。

模型特点

高性能幻觉检测

在RAG设置中检测上下文相关的幻觉，性能超越领先闭源模型

强大的指令遵循能力

在多种基准测试中表现优异，支持复杂的用户定义评分标准

低成本高效部署

FP8权重免费提供，适合研究和商业应用，成本仅为同类模型的一小部分

长上下文支持

可处理长达32k令牌的长输入，并提供详细的结构化理由

本地部署支持

适用于隐私敏感场景，支持在本地环境运行

模型能力

大语言模型评估

幻觉检测

指令遵循评估

偏好判断

结构化输出生成

长上下文处理

使用案例

模型评估

RAG系统幻觉检测

检测检索增强生成系统中的上下文相关幻觉

在HaluBench测试集上达到86.3%通过率

指令遵循评估

评估模型对复杂指令的遵循能力

在IFEval等基准测试中表现优异

内容审核

政治内容识别

识别文本中的政治相关内容和术语

🚀 RootSignals-Judge-Llama-70B 模型卡

Root Judge 是一款强大的中型大语言模型（LLM），能够实现可靠且可定制的大语言模型系统评估。它基于 Llama-3.3-70B-Instruct 在高质量、人工标注的数据集上进行了后续训练，适用于成对偏好选择判断以及带来源引用的多轮指令遵循任务。该模型的 FP8 权重免费提供，便于进行低成本的研究和商业应用。

Root Judge 在指令遵循方面的性能超越了 Llama-3.3-Instruct 模型及类似规模的开源模型，并且在幻觉检测方面与领先的闭源模型相比达到了当前最优水平，而成本却仅为其一小部分。

🚀 快速开始

3.1 通过 Root Signals Python SDK

模型可在我们的平台上作为评估套件的一部分免费使用。

安装我们的 Python 库：

pip install root-signals

导入：

from root import RootSignals
client = RootSignals()

创建一个由 Root Judge 驱动的自定义评估器：

my_custom_judge = client.evaluators.create(
    name="Political Text Evaluator",
    intent="To measure the politics-relatedness of a given text",
    predicate="Assess if a text containts political jargon or talks about politics: {{response}}",
    model="RootJudge",
)

执行：

result = my_custom_judge.run(
    response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee."
)
print(result.score)  # 归一化分数，范围在 [0-1] 之间
print(result.justification)  # 分数的详细推理

3.2 本地部署

我们建议在生产环境中使用 SGLang，并在提示的重要部分使用 xml 标签。虽然模型可以在 80GB VRAM 上运行，但我们建议至少使用 96GB 来评估长上下文的检索增强生成（RAG）输入。

单个 Nvidia H100（80GB）的 SGlang 示例：

docker run \
   --gpus all \
   --ipc=host  \
   -p 8000:8000 \
   -v huggingface:/root/.cache/huggingface \
   --volume /etc/localtime:/etc/localtime:ro \
   -d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \
   python3 -m sglang.launch_server \
   --model-path root-signals/RootSignals-Judge-Llama-70B \
   --host 0.0.0.0 \
   --port 8000 \
   --mem-fraction-static 0.89 \
   --grammar-backend xgrammar \
   --enable-torch-compile \
   --disable-cuda-graph

我们还在配备 vLLM 的 Nvidia GH200 arm64 上验证了该模型，最大输出可达 64k 令牌：

docker run \
   --gpus all \
   --ipc=host  \
   -p 8000:8000 \
   -v huggingface:/root/.cache/huggingface \
   --volume /etc/localtime:/etc/localtime:ro \
   -d drikster80/vllm-gh200-openai:v0.6.4.post1 \
   --model root-signals/RootSignals-Judge-Llama-70B \
   --gpu-memory-utilization 0.95 \
   --max-model-len 64k \
   --block_size 16 \
   --enable_prefix_caching

从上下文中检测幻觉，以下是使用 halubench 的示例：

decompose_system_instruction = """
<TASK>
You are a fair judge that detects hallucinations and unjustified assumptions from question-document-answer triplets provided by the user. 
Always follow the instructions below and provide your reasoning and verdict in the format specified.
</TASK>

<INSTRUCTIONS>
#1. Identify key elements in the question.
#2. List all relevant facts provided in the document.
#3. Break down the answer into its component claims.
#4. For each claim in the answer:
#a. Is it explicitly supported by the document? If yes, quote the relevant part.
#b. Is it a reasonable inference from the document? If yes, explain the reasoning.
#c. Is it unsupported or contradicted by the document? If yes, explain why.
#5. Check for any information in the answer that's present in the question but not in the document.
#6. Verify that no additional information is introduced in the answer that isn't in the document or question.
#7. Assess if the answer makes any unjustified connections or assumptions.
</INSTRUCTIONS>

<OUTPUT_EXAMPLE>
{"REASONING": "Your reasoning here where you cite the instruction step by number and provide your reasoning", "VERDICT": "PASS" or "FAIL"}
</OUTPUT_EXAMPLE>
"""

decompose_prompt = """
<QUESTION>: {question} </QUESTION>
<DOCUMENT>: {document} </DOCUMENT>
<ANSWER>: {answer} </ANSWER>
""".strip()

import os
import json
import pandas as pd
from openai import OpenAI
from pprint import pprint
from pydantic import BaseModel

testset_df = pd.read_parquet("hf://datasets/PatronusAI/HaluBench/data/test-00000-of-00001.parquet")
testset_df = testset_df.sample(frac=1).reset_index(drop=True)
example_row = testset_df.iloc[0]

class DecomposeResponse(BaseModel):
    REASONING: str
    VERDICT: str

client = OpenAI(base_url="http://localhost:8000/v1")  # export a different one for e.g. sglang, openrouter, etc.

response = client.beta.chat.completions.parse(
    model="root-signals/RootSignals-Judge-Llama-70B",  # or `RootJudge` if you are using the RootSignals API
    messages=[
        {"role": "system", "content": decompose_system_instruction},
        {"role": "user", "content": decompose_prompt.format(
            question=example_row["question"], 
            document=example_row["passage"], 
            answer=example_row["answer"])},
    ],
    response_format=DecomposeResponse,
).choices[0].message.parsed

pprint(response.REASONING)
pprint(response.VERDICT)

> ('Following the instructions: #1, the key element in the question is the '
 "nationality of the magazines. #2, the document states that 'The Woman's "
 "Viewpoint was a woman's magazine founded in Texas in 1923' and 'Pick Me Up! "
 "is a British weekly women's magazine'. #3, the answer claims both magazines "
 'are British. #4, checking each claim in the answer: a) The document does not '
 "support the claim that The Woman's Viewpoint is British, instead, it says "
 "the magazine was founded in Texas. b) There's no reasonable inference from "
 "the document that would suggest The Woman's Viewpoint is British. c) The "
 "claim about The Woman's Viewpoint is contradicted by the document. #5, the "
 'answer introduces information (both being British) not supported by the '
 'document. #6, additional information about both magazines being British is '
 'introduced in the answer without being present in the document or question. '
 '#7, the answer makes an unjustified assumption by stating both magazines are '
 "British despite the document clearly stating The Woman's Viewpoint was "
 'founded in Texas, implying it is not British. Therefore, the answer fails to '
 'accurately reflect the information provided in the document and makes '
 'unjustified assumptions based on the information given in the question and '
 "document.', ")
'FAIL'

✨ 主要特性

1. 预期用例

Root Judge 主要旨在作为大语言模型评估器（LLM-as-a-Judge）在以下各种场景中使用：

以可解释的方式检测上下文相关的幻觉，例如在 检索增强生成（RAG）设置中，并为评分提供理由。
凭借强大的评估指令遵循能力进行成对偏好判断。
作为由特定用例评估标准驱动的自定义评估指标。
协助推理时搜索或需要 N 选 1 决策的合成数据任务。
适用于需要本地部署的隐私敏感场景。

2. 性能总结

Root Judge 在评估中检测指令遵循失败时优于领先的闭源模型，同时在内部基准测试和 halubench 公开测试中，能对长达 32k 令牌的长输入提供详细、结构化的理由。

2.1 幻觉检测（在 RAG 设置中）

📊 基准测试：HaluBench 测试集：

排名	模型	测试样本	通过率@1（%）	成本（$）
1	Root Judge	14900	86.3	3.98
2	GPT-4o	14900	86.1	33.12
3	o1-preview	14899	85.3	1062*
4	Claude Sonnet-3.5	14797	85.2	42.94
5	Llama3.1-70b-Instruct	13969	84.7	27.43
6	o1-mini	14655	83.7	156
7	Llama3.1-405b-Instruct	14881	83.6	269.82

* 以 o1-preview 进行基准测试；按当前 o1 价格计算，不包括推理令牌，成本将从 198.74 美元起。本地成本基于 2025 年 1 月的 lambdalabs 实例价格。

📄 详细性能细分 - 幻觉检测

2.2 指令遵循

📊 在各种不同的基准测试中，与其他开源权重评估和奖励模型相比的指令遵循性能（越高越好）：

排名	模型	VRAM（GB）	GSM8K（%）	IFEval（%）	MUSR-Murder（%）	MUSR-Object（%）	MUSR-Team（%）	平均得分	相对于 Root Judge（%）
1	Root Judge	70	94.6 ± 0.6	93.9	52.8 ± 3.2	24.6 ± 2.7	56.8 ± 3.1	64.5	100
2	Llama-3.3-70B	140	94.4 ± 0.6	93.4	54.0 ± 3.2	23.4 ± 2.7	56.0 ± 3.2	64.3	99.5
3	Patronus-70B	140	91.7 ± 0.8	83.7	54.4 ± 3.2	24.6 ± 2.7	48.8 ± 3.2	60.6	93.9
4	Nemotron-70B	70	80.1 ± 1.1	85.0	53.6 ± 3.2	23.8 ± 2.7	55.6 ± 3.1	59.6	92.4
5	Qwen-2.5-32B	64	87.4 ± 0.9	87.5	58.8 ± 3.1	23.1 ± 2.6	45.2 ± 3.2	60.4	93.6
6	Flow Judge	16	78.7 ± 1.1	64.6	60.8 ± 3.1	23.4 ± 2.7	35.6 ± 3.0	52.6	81.5
7	Glider	8	78.7 ± 1.1	56.5	59.2 ± 3.1	35.9 ± 3.0	43.2 ± 3.1	54.7	84.8

📄 详细性能细分 - 指令遵循

2.3 Root Signals 内部基准测试

📊 基准测试：Root Signals 内部幻觉检测基准测试

image/png 图 1：通过领先的第三方模型集成评估的总通过率@1 和一致性（差异）。

image/png 图 2：按高级任务进行的自定义评分标准指令遵循情况。

Root Judge 经过测试，支持在大上下文大小下进行复杂的、用户定义的评分（评级）标准。它提供细致的定性反馈，并支持结构化评估输出以及工具调用。

2.4 其他基准测试

📊 RewardBench

RewardBench

基准测试任务	得分	总数	准确率
alpacaeval-easy	99.0	100	0.99
alpacaeval-hard	93.0	95	0.97894737
alpacaeval-length	86.0	95	0.90526316
donotanswer	73.5	136	0.54044118
hep-cpp	159.0	164	0.96951220
hep-go	159.0	164	0.96951220
hep-java	161.0	164	0.98170732
hep-js	159.0	164	0.96951220
hep-python	158.0	164	0.96341463
hep-rust	152.0	164	0.92682927
llmbar-adver-GPTInst	69.0	92	0.75
llmbar-adver-GPTOut	39.0	47	0.82978723
llmbar-adver-manual	32.0	46	0.69565217
llmbar-adver-neighbor	74.0	134	0.55223881
llmbar-natural	94.0	100	0.94
math-prm	357.0	447	0.79865772
mt-bench-easy	28.0	28	1.0
mt-bench-hard	32.0	37	0.86486486
mt-bench-med	40.0	40	1.0
refusals-dangerous	73.5	100	0.735
refusals-offensive	89.0	100	0.89
xstest-should-refuse	140.5	154	0.91233766
xstest-should-respond	245.0	250	0.98
Chat			0.96648045
Chat Hard			0.74561404
Safety			0.83986486
Reasoning			0.88103618

尽管我们主要关注对候选响应进行细致和透明的判断，但我们还是在公共和私有基准测试中对评估模型的检查点进行了广泛测试，以避免出现性能下降等已知问题，如灾难性遗忘。我们发现，该模型在动态权重量化后保留了 Llama-3.3-70B-Instruct 的一般能力，同时在公共指令遵循基准测试（如 IFEval 和 MuSR）中略优于它。