RootSignals-Judge-Llama-70B Open Source Large Language Model - Free Deployment for Reliable Evaluation of LLM Systems

Rootsignals Judge Llama 70B

Developed by root-signals

Root Judge is a powerful medium-sized large language model designed for reliable and customizable LLM system evaluation. Fine-tuned based on Llama-3.3-70B-Instruct, it excels in pairwise preference judgment and multi-round instruction following tasks with source references.

Large Language Model

Safetensors

English#Hallucination detection #Instruction following evaluation #RAG quality evaluation

Downloads 620

Release Time : 2/5/2025

Model Overview

Root Judge is a medium-sized model focused on large language model evaluation, performing excellently in hallucination detection and instruction following, and supporting local deployment and low-cost applications.

Model Features

High-performance hallucination detection

Detect context-related hallucinations in RAG settings, outperforming leading closed-source models

Powerful instruction following ability

Performs excellently in various benchmark tests and supports complex user-defined scoring criteria

Low-cost and efficient deployment

FP8 weights are provided for free, suitable for research and commercial applications, with costs only a fraction of similar models

Long context support

Can handle long inputs up to 32k tokens and provide detailed structured justifications

Local deployment support

Suitable for privacy-sensitive scenarios and supports running in a local environment

Model Capabilities

Large language model evaluation

Hallucination detection

Instruction following evaluation

Preference judgment

Structured output generation

Long context processing

Use Cases

Model evaluation

RAG system hallucination detection

Detect context-related hallucinations in retrieval-augmented generation systems

Achieved an 86.3% pass rate on the HaluBench test set

Instruction following evaluation

Evaluate the model's ability to follow complex instructions

Performed excellently in benchmark tests such as IFEval

Content moderation

Political content recognition

Identify politically relevant content and terms in text

🚀 Model Card for RootSignals-Judge-Llama-70B

Root Judge is a powerful mid - sized LLM that enables reliable and customizable LLM system evaluations. It was post - trained from Llama - 3.3 - 70B - Instruct on a high - quality, human - annotated dataset mix for pairwise preference choice judgments and multi - turn instruction following with source citing. The model weights are freely available in FP8 to facilitate cost - effective research as well as commercial use.

Root Judge's performance surpasses the Llama - 3.3 - Instruct model and similar - sized open models on Instruction following and achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.

🚀 Quick Start

Via Root Signals Python SDK

The model is available on our platform as part of our evaluation suite, for no additional cost.

Install our python library:

pip install root-signals

Import:

from root import RootSignals
client = RootSignals()

Create a custom evaluator powered by Root Judge:

my_custom_judge = client.evaluators.create(
    name="Political Text Evaluator",
    intent="To measure the politics - relatedness of a given text",
    predicate="Assess if a text containts political jargon or talks about politics: {{response}}",
    model="RootJudge",
)

Execute:

result = my_custom_judge.run(
    response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee."
)
print(result.score)  # normalized score between [0 - 1]
print(result.justification)  # detailed reasoning for the score

Locally

We recommend using SGLang for production use - cases together with xml tags for important sections in your prompt. While the model can run on 80GB VRAM, we recommend at least 96GB for evaluating long - context RAG inputs.

SGlang example for a single Nvidia H100 (80GB):

docker run \
   --gpus all \
   --ipc=host  \
   -p 8000:8000 \
   -v huggingface:/root/.cache/huggingface \
   --volume /etc/localtime:/etc/localtime:ro \
   -d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \
   python3 -m sglang.launch_server \
   --model-path root-signals/RootSignals-Judge-Llama-70B \
   --host 0.0.0.0 \
   --port 8000 \
   --mem-fraction-static 0.89 \
   --grammar-backend xgrammar \
   --enable-torch-compile \
   --disable-cuda-graph

We validated the model on arm64 with vLLM on Nvidia GH200 as well with max outputs up to 64k tokens:

docker run \
   --gpus all \
   --ipc=host  \
   -p 8000:8000 \
   -v huggingface:/root/.cache/huggingface \
   --volume /etc/localtime:/etc/localtime:ro \
   -d drikster80/vllm-gh200-openai:v0.6.4.post1 \
   --model root-signals/RootSignals-Judge-Llama-70B \
   --gpu-memory-utilization 0.95 \
   --max-model-len 64k \
   --block_size 16 \
   --enable_prefix_caching

Detect hallucinations from context, example uses halubench:

decompose_system_instruction = """
<TASK>
You are a fair judge that detects hallucinations and unjustified assumptions from question - document - answer triplets provided by the user. 
Always follow the instructions below and provide your reasoning and verdict in the format specified.
</TASK>

<INSTRUCTIONS>
#1. Identify key elements in the question.
#2. List all relevant facts provided in the document.
#3. Break down the answer into its component claims.
#4. For each claim in the answer:
#a. Is it explicitly supported by the document? If yes, quote the relevant part.
#b. Is it a reasonable inference from the document? If yes, explain the reasoning.
#c. Is it unsupported or contradicted by the document? If yes, explain why.
#5. Check for any information in the answer that's present in the question but not in the document.
#6. Verify that no additional information is introduced in the answer that isn't in the document or question.
#7. Assess if the answer makes any unjustified connections or assumptions.
</INSTRUCTIONS>

<OUTPUT_EXAMPLE>
{"REASONING": "Your reasoning here where you cite the instruction step by number and provide your reasoning", "VERDICT": "PASS" or "FAIL"}
</OUTPUT_EXAMPLE>
"""

decompose_prompt = """
<QUESTION>: {question} </QUESTION>
<DOCUMENT>: {document} </DOCUMENT>
<ANSWER>: {answer} </ANSWER>
""".strip()

import os
import json
import pandas as pd
from openai import OpenAI
from pprint import pprint
from pydantic import BaseModel

testset_df = pd.read_parquet("hf://datasets/PatronusAI/HaluBench/data/test-00000-of-00001.parquet")
testset_df = testset_df.sample(frac=1).reset_index(drop=True)
example_row = testset_df.iloc[0]

class DecomposeResponse(BaseModel):
    REASONING: str
    VERDICT: str

client = OpenAI(base_url="http://localhost:8000/v1")  # export a different one for e.g. sglang, openrouter, etc.

response = client.beta.chat.completions.parse(
    model="root-signals/RootSignals-Judge-Llama-70B",  # or `RootJudge` if you are using the RootSignals API
    messages=[
        {"role": "system", "content": decompose_system_instruction},
        {"role": "user", "content": decompose_prompt.format(
            question=example_row["question"], 
            document=example_row["passage"], 
            answer=example_row["answer"])},
    ],
    response_format=DecomposeResponse,
).choices[0].message.parsed

pprint(response.REASONING)
pprint(response.VERDICT)

> ('Following the instructions: #1, the key element in the question is the '
 "nationality of the magazines. #2, the document states that 'The Woman's "
 "Viewpoint was a woman's magazine founded in Texas in 1923' and 'Pick Me Up! "
 "is a British weekly women's magazine'. #3, the answer claims both magazines "
 'are British. #4, checking each claim in the answer: a) The document does not '
 "support the claim that The Woman's Viewpoint is British, instead, it says "
 "the magazine was founded in Texas. b) There's no reasonable inference from "
 "the document that would suggest The Woman's Viewpoint is British. c) The "
 "claim about The Woman's Viewpoint is contradicted by the document. #5, the "
 'answer introduces information (both being British) not supported by the '
 'document. #6, additional information about both magazines being British is '
 'introduced in the answer without being present in the document or question. '
 '#7, the answer makes an unjustified assumption by stating both magazines are '
 "British despite the document clearly stating The Woman's Viewpoint was "
 'founded in Texas, implying it is not British. Therefore, the answer fails to '
 'accurately reflect the information provided in the document and makes '
 'unjustified assumptions based on the information given in the question and '
 "document.', ")
'FAIL'

✨ Features

Intended Use Cases

Root Judge is primarily intended to be used as an LLM - as - a - Judge in various contexts such as:

Detecting context - grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
Pairwise preference judgments due to strong evaluation instruction - following capabilities
Serving as a custom evaluation metric powered by use case specific evaluation rubrics
Assisting inference - time search or synthetic data tasks that require Best - of - N decisions
Privacy - focused settings that require local deployments

Performance Summary

Root Judge outperforms leading closed models when detecting instruction - following failures on evaluations while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.

Hallucination Detection (in RAG setting)

📊 Benchmark: HaluBench Test Set:

Rank	Model	Test Samples	Pass@1 Rate (%)	Cost ($)
1	Root Judge	14900	86.3	3.98
2	GPT - 4o	14900	86.1	33.12
3	o1 - preview	14899	85.3	1062*
4	Claude Sonnet - 3.5	14797	85.2	42.94
5	Llama3.1 - 70b - Instruct	13969	84.7	27.43
6	o1 - mini	14655	83.7	156
7	Llama3.1 - 405b - Instruct	14881	83.6	269.82

* = benchmarked as o1 - preview; at current o1 prices, without reasoning tokens, the cost would start at $198.74 instead. Local Costs based on lambdalabs instances at January 2025 prices.

📄 Detailed Performance Breakdown - Hallucination Detection

Instruction Following

📊 Instruction - following performance in various diverse benchmarks compared to other open - weights judge and reward models (higher is better):

Rank	Model	VRAM (GB)	GSM8K (%)	IFEval (%)	MUSR - Murder (%)	MUSR - Object (%)	MUSR - Team (%)	Avg Score	Relative to Root Judge (%)
1	Root Judge	70	94.6 ± 0.6	93.9	52.8 ± 3.2	24.6 ± 2.7	56.8 ± 3.1	64.5	100
2	Llama - 3.3 - 70B	140	94.4 ± 0.6	93.4	54.0 ± 3.2	23.4 ± 2.7	56.0 ± 3.2	64.3	99.5
3	Patronus - 70B	140	91.7 ± 0.8	83.7	54.4 ± 3.2	24.6 ± 2.7	48.8 ± 3.2	60.6	93.9
4	Nemotron - 70B	70	80.1 ± 1.1	85.0	53.6 ± 3.2	23.8 ± 2.7	55.6 ± 3.1	59.6	92.4
5	Qwen - 2.5 - 32B	64	87.4 ± 0.9	87.5	58.8 ± 3.1	23.1 ± 2.6	45.2 ± 3.2	60.4	93.6
6	Flow Judge	16	78.7 ± 1.1	64.6	60.8 ± 3.1	23.4 ± 2.7	35.6 ± 3.0	52.6	81.5
7	Glider	8	78.7 ± 1.1	56.5	59.2 ± 3.1	35.9 ± 3.0	43.2 ± 3.1	54.7	84.8

[📄 Detailed Performance Breakdown | Intruction - following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO - EQXFDkw17WXKHAeGg02 - 8Qg/edit?usp=sharing)

Root Signals Internal Benchmarks

📊 Benchmark: Root Signals Internal Hallucination Detection Benchmark

![image/png](https://cdn - uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/rHq5RakEPkOlnC69MOl1e.png) Image 1: Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.

![image/png](https://cdn - uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/zfsh6HTbYH1HpLItWgq8u.png) Image 2: Custom rubric instruction - following by high level task.

Root Judge was tested to support complex, user - defined scoring (rating) rubrics over large context sizes. It provides granular qualitative feedback and supports structured evaluation outputs as well as tool calling.

Other Benchmarks

📊 RewardBench

[RewardBench](https://huggingface.co/spaces/allenai/reward - bench)

Benchmark Task	Score	Total	Accuracy
alpacaeval - easy	99.0	100	0.99
alpacaeval - hard	93.0	95	0.97894737
alpacaeval - length	86.0	95	0.90526316
donotanswer	73.5	136	0.54044118
hep - cpp	159.0	164	0.96951220
hep - go	159.0	164	0.96951220
hep - java	161.0	164	0.98170732
hep - js	159.0	164	0.96951220
hep - python	158.0	164	0.96341463
hep - rust	152.0	164	0.92682927
llmbar - adver - GPTInst	69.0	92	0.75
llmbar - adver - GPTOut	39.0	47	0.82978723
llmbar - adver - manual	32.0	46	0.69565217
llmbar - adver - neighbor	74.0	134	0.55223881
llmbar - natural	94.0	100	0.94
math - prm	357.0	447	0.79865772
mt - bench - easy	28.0	28	1.0
mt - bench - hard	32.0	37	0.86486486
mt - bench - med	40.0	40	1.0
refusals - dangerous	73.5	100	0.735
refusals - offensive	89.0	100	0.89
xstest - should - refuse	140.5	154	0.91233766
xstest - should - respond	245.0	250	0.98
Chat			0.96648045
Chat Hard			0.74561404
Safety			0.83986486
Reasoning			0.88103618

Despite our main focus on nuanced and transparent judgement of candidate responses, we test the judge model checkpoints extensively on public and private benchmarks to avoid known issues with performance drops such as catastrophic forgetting and find that the model preserves general capabilities of Llama - 3.3 - 70B - Instruct after dynamic weights quantization, while also slightly outperforming it on public instruction - following benchmarks such as IFEval and MuSR.

📚 Documentation

Model Details

Overview

Property	Details
Developed by	Root Signals Inc
Model Type	Text - Only Decoder Transformer
Language(s) (NLP)	Primarily English
Finetuned from model	meta - llama/Llama - 3.3 - 70B - Instruct

Training Details

Property	Details
Training regime	DPO with IPO loss for 3 Epochs, bfloat16 mixed - precision on 384 GPUs
Hardware Type	LUMI - G / AMD Radeon Instinct™ MI250X
Cloud Provider	[LUMI Supercomputer](https://lumi - supercomputer.eu)
Compute Region	Finland

📄 License

The license for this model is llama3.3.

📞 Contact

Email

hello@rootsignals.ai

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご