RootSignals-Judge-Llama-70Bオープンソース大規模言語モデル - LLMシステムの信頼できる評価に無料でデプロイ可能

Rootsignals Judge Llama 70B

Developed by root-signals

Root Judgeは、信頼性が高くカスタマイズ可能なLLMシステム評価用に設計された強力な中型大規模言語モデルです。Llama-3.3-70B-Instructをベースに微調整され、ペアワイズの嗜好判断や出所引用付きの多輪指令遵守タスクに長けています。

大規模言語モデル

Safetensors

English#幻覚検出 #指令遵守評価 #RAG品質評価

Downloads 620

Release Time : 2/5/2025

Model Overview

Root Judgeは、大規模言語モデル評価に特化した中型モデルで、幻覚検出と指令遵守において優れた性能を発揮し、ローカルデプロイと低コストアプリケーションをサポートします。

Model Features

高性能幻覚検出

RAG設定において文脈関連の幻覚を検出し、主要な閉ソースモデルを上回る性能を発揮します。

強力な指令遵守能力

様々なベンチマークテストで優れた成績を収め、複雑なユーザ定義評価基準をサポートします。

低コストで効率的なデプロイ

FP8重みが無料で提供され、研究や商用アプリケーションに適しており、同類のモデルと比べてコストがごく一部です。

長文脈サポート

最大32kトークンの長い入力を処理でき、詳細な構造化理由を提供します。

ローカルデプロイサポート

プライバシーに敏感なシナリオに適しており、ローカル環境での実行をサポートします。

Model Capabilities

大規模言語モデル評価

幻覚検出

指令遵守評価

嗜好判断

構造化出力生成

長文脈処理

Use Cases

モデル評価

RAGシステム幻覚検出

検索強化生成システムにおける文脈関連の幻覚を検出します。

HaluBenchテストセットで86.3%の合格率を達成しました。

指令遵守評価

モデルの複雑な指令に対する遵守能力を評価します。

IFEvalなどのベンチマークテストで優れた成績を収めました。

コンテンツ審査

政治コンテンツ識別

テキスト中の政治関連のコンテンツと用語を識別します。

🚀 RootSignals-Judge-Llama-70Bのモデルカード

Root Judge は、信頼性が高くカスタマイズ可能な大規模言語モデル（LLM）システムの評価を可能にする強力な中規模のLLMです。 Root Judgeは、Llama-3.3-70B-Instruct を基に、ペアワイズの選好判断と出所引用付きのマルチターン命令追従に特化した高品質の人間アノテーションデータセットで追加学習されました。モデルの重みはFP8形式で無料で提供されており、研究や商用利用のコスト削減に役立ちます。

Root Judge は、命令追従能力においてLlama-3.3-Instructモデルや同規模のオープンモデルを上回り、幻覚検出においては主要なクローズドモデルと比較して、低コストでSOTA（最先端）の性能を達成しています。

🚀 クイックスタート

3.1 Root Signals Python SDK経由

モデルは、評価ツールキットの一部として、当社のプラットフォームで無料で利用できます。

Pythonライブラリをインストールします。

pip install root-signals

インポートします。

from root import RootSignals
client = RootSignals()

Root Judge を使用してカスタム評価器を作成します。

my_custom_judge = client.evaluators.create(
    name="Political Text Evaluator",
    intent="To measure the politics-relatedness of a given text",
    predicate="Assess if a text containts political jargon or talks about politics: {{response}}",
    model="RootJudge",
)

実行します。

result = my_custom_judge.run(
    response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee."
)
print(result.score)  # 0から1の正規化されたスコア
print(result.justification)  # スコアの詳細な理由

3.2 ローカル環境

本番環境では、SGLang を使用することをおすすめします。プロンプトの重要な部分には xmlタグ を使用します。モデルは80GBのVRAMで動作しますが、長文脈のRAG入力を評価する場合は、96GB以上のVRAMを推奨します。

単一のNvidia H100（80GB）でのSGlangの例：

docker run \
   --gpus all \
   --ipc=host  \
   -p 8000:8000 \
   -v huggingface:/root/.cache/huggingface \
   --volume /etc/localtime:/etc/localtime:ro \
   -d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \
   python3 -m sglang.launch_server \
   --model-path root-signals/RootSignals-Judge-Llama-70B \
   --host 0.0.0.0 \
   --port 8000 \
   --mem-fraction-static 0.89 \
   --grammar-backend xgrammar \
   --enable-torch-compile \
   --disable-cuda-graph

Nvidia GH200で vLLM を使用してarm64でモデルを検証しました。最大出力は64kトークンまでです。

docker run \
   --gpus all \
   --ipc=host  \
   -p 8000:8000 \
   -v huggingface:/root/.cache/huggingface \
   --volume /etc/localtime:/etc/localtime:ro \
   -d drikster80/vllm-gh200-openai:v0.6.4.post1 \
   --model root-signals/RootSignals-Judge-Llama-70B \
   --gpu-memory-utilization 0.95 \
   --max-model-len 64k \
   --block_size 16 \
   --enable_prefix_caching

コンテキストから幻覚を検出する例（halubenchを使用）：

decompose_system_instruction = """
<TASK>
You are a fair judge that detects hallucinations and unjustified assumptions from question-document-answer triplets provided by the user. 
Always follow the instructions below and provide your reasoning and verdict in the format specified.
</TASK>

<INSTRUCTIONS>
#1. Identify key elements in the question.
#2. List all relevant facts provided in the document.
#3. Break down the answer into its component claims.
#4. For each claim in the answer:
#a. Is it explicitly supported by the document? If yes, quote the relevant part.
#b. Is it a reasonable inference from the document? If yes, explain the reasoning.
#c. Is it unsupported or contradicted by the document? If yes, explain why.
#5. Check for any information in the answer that's present in the question but not in the document.
#6. Verify that no additional information is introduced in the answer that isn't in the document or question.
#7. Assess if the answer makes any unjustified connections or assumptions.
</INSTRUCTIONS>

<OUTPUT_EXAMPLE>
{"REASONING": "Your reasoning here where you cite the instruction step by number and provide your reasoning", "VERDICT": "PASS" or "FAIL"}
</OUTPUT_EXAMPLE>
"""

decompose_prompt = """
<QUESTION>: {question} </QUESTION>
<DOCUMENT>: {document} </DOCUMENT>
<ANSWER>: {answer} </ANSWER>
""".strip()

import os
import json
import pandas as pd
from openai import OpenAI
from pprint import pprint
from pydantic import BaseModel

testset_df = pd.read_parquet("hf://datasets/PatronusAI/HaluBench/data/test-00000-of-00001.parquet")
testset_df = testset_df.sample(frac=1).reset_index(drop=True)
example_row = testset_df.iloc[0]

class DecomposeResponse(BaseModel):
    REASONING: str
    VERDICT: str

client = OpenAI(base_url="http://localhost:8000/v1")  # export a different one for e.g. sglang, openrouter, etc.

response = client.beta.chat.completions.parse(
    model="root-signals/RootSignals-Judge-Llama-70B",  # or `RootJudge` if you are using the RootSignals API
    messages=[
        {"role": "system", "content": decompose_system_instruction},
        {"role": "user", "content": decompose_prompt.format(
            question=example_row["question"], 
            document=example_row["passage"], 
            answer=example_row["answer"])},
    ],
    response_format=DecomposeResponse,
).choices[0].message.parsed

pprint(response.REASONING)
pprint(response.VERDICT)

> ('Following the instructions: #1, the key element in the question is the '
 "nationality of the magazines. #2, the document states that 'The Woman's "
 "Viewpoint was a woman's magazine founded in Texas in 1923' and 'Pick Me Up! "
 "is a British weekly women's magazine'. #3, the answer claims both magazines "
 'are British. #4, checking each claim in the answer: a) The document does not '
 "support the claim that The Woman's Viewpoint is British, instead, it says "
 "the magazine was founded in Texas. b) There's no reasonable inference from "
 "the document that would suggest The Woman's Viewpoint is British. c) The "
 "claim about The Woman's Viewpoint is contradicted by the document. #5, the "
 'answer introduces information (both being British) not supported by the '
 'document. #6, additional information about both magazines being British is '
 'introduced in the answer without being present in the document or question. '
 '#7, the answer makes an unjustified assumption by stating both magazines are '
 "British despite the document clearly stating The Woman's Viewpoint was "
 'founded in Texas, implying it is not British. Therefore, the answer fails to '
 'accurately reflect the information provided in the document and makes '
 'unjustified assumptions based on the information given in the question and '
 "document.', ")
'FAIL'

✨ 主な機能

1. 想定される使用ケース

Root Judge は、主に以下のような様々なコンテキストでLLMアズアジャッジとして使用されることを想定しています。

文脈に基づく幻覚の検出。例えば、Retrieval Augmented Generation (RAG) 設定において、スコアの正当性を説明可能な形で提供します。
強力な評価命令追従能力によるペアワイズの選好判断。
特定のユースケースに基づく評価基準によるカスタム評価指標としての利用。
推論時の検索や合成データタスクにおけるBest-of-N決定を必要とするタスクの支援。
ローカルデプロイが必要なプライバシー重視の設定。

2. 性能の概要

Root Judge は、内部ベンチマークやHalubench公開データセットで、最大32kトークンの長い入力に対して、詳細で構造化された正当化を提供しながら、評価時の命令追従失敗の検出において、主要なクローズドモデルを上回っています。

2.1 幻覚検出（RAG設定）

📊 ベンチマーク: HaluBench Test Set

順位	モデル	テストサンプル数	Pass@1率 (%)	コスト ($)
1	Root Judge	14900	86.3	3.98
2	GPT-4o	14900	86.1	33.12
3	o1-preview	14899	85.3	1062*
4	Claude Sonnet-3.5	14797	85.2	42.94
5	Llama3.1-70b-Instruct	13969	84.7	27.43
6	o1-mini	14655	83.7	156
7	Llama3.1-405b-Instruct	14881	83.6	269.82

* = o1-previewとしてベンチマーク。現在のo1価格で、推論トークンを除くと、コストは$198.74から始まります。ローカルコストは、2025年1月のlambdalabsインスタンスの価格に基づいています。

📄 詳細な性能内訳 - 幻覚検出

2.2 命令追従

様々なベンチマークにおける命令追従性能を、他のオープンウェイトのジャッジモデルや報酬モデルと比較しました（数値が高いほど良い）。

順位	モデル	VRAM (GB)	GSM8K (%)	IFEval (%)	MUSR-Murder (%)	MUSR-Object (%)	MUSR-Team (%)	平均スコア	Root Judgeに対する相対値 (%)
1	Root Judge	70	94.6 ± 0.6	93.9	52.8 ± 3.2	24.6 ± 2.7	56.8 ± 3.1	64.5	100
2	Llama-3.3-70B	140	94.4 ± 0.6	93.4	54.0 ± 3.2	23.4 ± 2.7	56.0 ± 3.2	64.3	99.5
3	Patronus-70B	140	91.7 ± 0.8	83.7	54.4 ± 3.2	24.6 ± 2.7	48.8 ± 3.2	60.6	93.9
4	Nemotron-70B	70	80.1 ± 1.1	85.0	53.6 ± 3.2	23.8 ± 2.7	55.6 ± 3.1	59.6	92.4
5	Qwen-2.5-32B	64	87.4 ± 0.9	87.5	58.8 ± 3.1	23.1 ± 2.6	45.2 ± 3.2	60.4	93.6
6	Flow Judge	16	78.7 ± 1.1	64.6	60.8 ± 3.1	23.4 ± 2.7	35.6 ± 3.0	52.6	81.5
7	Glider	8	78.7 ± 1.1	56.5	59.2 ± 3.1	35.9 ± 3.0	43.2 ± 3.1	54.7	84.8

📄 詳細な性能内訳 - 命令追従

2.3 Root Signals内部ベンチマーク

📊 ベンチマーク: Root Signals内部幻覚検出ベンチマーク

image/png 画像1: 主要なサードパーティモデルのアンサンブルによって評価されたTotal pass@1率と一貫性 (delta)。

image/png 画像2: 高レベルタスクによるカスタムルーブリックの命令追従。

Root Judge は、大きなコンテキストサイズにわたる複雑なユーザ定義のスコアリング（評価）ルーブリックをサポートするようにテストされています。詳細な定性的なフィードバックを提供し、構造化された評価出力やツール呼び出しもサポートしています。

2.4 その他のベンチマーク

📊 RewardBench

RewardBench

ベンチマークタスク	スコア	合計	正解率
alpacaeval-easy	99.0	100	0.99
alpacaeval-hard	93.0	95	0.97894737
alpacaeval-length	86.0	95	0.90526316
donotanswer	73.5	136	0.54044118
hep-cpp	159.0	164	0.96951220
hep-go	159.0	164	0.96951220
hep-java	161.0	164	0.98170732
hep-js	159.0	164	0.96951220
hep-python	158.0	164	0.96341463
hep-rust	152.0	164	0.92682927
llmbar-adver-GPTInst	69.0	92	0.75
llmbar-adver-GPTOut	39.0	47	0.82978723
llmbar-adver-manual	32.0	46	0.69565217
llmbar-adver-neighbor	74.0	134	0.55223881
llmbar-natural	94.0	100	0.94
math-prm	357.0	447	0.79865772
mt-bench-easy	28.0	28	1.0
mt-bench-hard	32.0	37	0.86486486
mt-bench-med	40.0	40	1.0
refusals-dangerous	73.5	100	0.735
refusals-offensive	89.0	100	0.89
xstest-should-refuse	140.5	154	0.91233766
xstest-should-respond	245.0	250	0.98
Chat			0.96648045
Chat Hard			0.74561404
Safety			0.83986486
Reasoning			0.88103618

候補応答の微妙で透明な判断に主に焦点を当てているものの、公開および非公開のベンチマークでジャッジモデルのチェックポイントを広範にテストし、性能低下の既知の問題（例えば、災害的忘却）を回避しています。モデルは、動的重み量子化後もLlama-3.3-70B-Instructの一般的な能力を保持し、IFEvalやMuSRなどの公開命令追従ベンチマークでもわずかに上回る性能を発揮しています。

📚 ドキュメント

4.1 概要

属性	詳情
開発元	Root Signals Inc
モデルタイプ	テキスト専用デコーダTransformer
言語 (NLP)	主に英語
ファインチューニング元モデル	meta-llama/Llama-3.3-70B-Instruct