開源phi3-hallucination-judge-merge模型 - 有效檢測語言模型輸出的幻覺問題

首頁

Phi3 Hallucination Judge Merge

由grounded-ai開發

該模型用於檢測語言模型輸出中的幻覺現象，即響應連貫但事實錯誤或脫離上下文的情況。

大型語言模型

Transformers

開源協議:MIT #幻覺檢測 #二分類任務 #PEFT微調

下載量 63

發布時間 : 4/25/2025

模型概述

一個專門用於檢測語言模型輸出幻覺的二分類模型，通過微調實現高性能的幻覺檢測能力。

模型特點

高性能幻覺檢測

在幻覺檢測任務中表現優異，F1分數達到0.81，超越多個前沿語言模型。

輕量級適配器

採用PEFT適配器技術，實現高效微調而不需要修改基礎模型。

標準化提示策略

提供標準化的輸入格式和提示策略，便於快速集成到現有系統中。

模型能力

幻覺檢測

文本分類

語言模型輸出評估

使用案例

語言模型質量評估

模型輸出驗證

驗證語言模型生成內容的事實準確性

準確識別85%的幻覺輸出

內容審核

事實核查

自動檢測生成內容中的事實錯誤

召回率達到87%的錯誤檢測

🚀 幻覺檢測PEFT適配器模型

本倉庫包含我們用於幻覺評估的PEFT適配器模型。該模型旨在檢測語言模型輸出中的幻覺現象，通過二分類任務評估模型是否產生了與事實不符或無意義的輸出。

🚀 快速開始

幻覺檢測指標

我們的合併模型在檢測語言模型輸出幻覺的二分類任務中取得了以下性能：

              precision    recall  f1-score   support

           0       0.85      0.71      0.77       100
           1       0.75      0.87      0.81       100

    accuracy                           0.79       200
   macro avg       0.80      0.79      0.79       200
weighted avg       0.80      0.79      0.79       200

模型使用

為獲得最佳效果，我們建議從以下提示策略開始（並鼓勵根據需要進行調整）：

def format_input(reference, query, response):
    prompt = f"""Your job is to evaluate whether a machine learning model has hallucinated or not.
    A hallucination occurs when the response is coherent but factually incorrect or nonsensical
    outputs that are not grounded in the provided context.
    You are given the following information:
    ####INFO####
    [Knowledge]: {reference}
    [User Input]: {query}
    [Model Response]: {response}
    ####END INFO####
    Based on the information provided is the model output a hallucination? Respond with only "yes" or "no"
    """
    return input

text = format_input(query='Based on the follwoing <context>Walrus are the largest mammal</context> answer the question <query> What is the best PC?</query>',
          response='The best PC is the mac')

messages = [
    {"role": "user", "content": text}
]

pipe = pipeline(
    "text-generation",
    model=base_model,
    model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
    tokenizer=tokenizer,
)
generation_args = {
      "max_new_tokens": 2,
      "return_full_text": False,
      "temperature": 0.01,
      "do_sample": True,
  }

output = pipe(messages, **generation_args)
print(f'Hallucination: {output[0]["generated_text"].strip().lower()}')
# Hallucination: yes

與其他模型的比較

我們將合併模型在幻覺檢測基準上的性能與其他幾個最先進的語言模型進行了比較：

模型	精確率	召回率	F1分數
我們的合併模型	0.75	0.87	0.81
GPT - 4	0.93	0.72	0.82
GPT - 4 Turbo	0.97	0.70	0.81
Gemini Pro	0.89	0.53	0.67
GPT - 3.5	0.89	0.65	0.75
GPT - 3.5 - turbo - instruct	0.89	0.80	0.84
Palm 2 (Text Bison)	1.00	0.44	0.61
Claude V2	0.80	0.95	0.87

如表所示，我們的合併模型在幻覺檢測任務中取得了0.81的F1分數，優於其他幾個最先進的語言模型。

我們將繼續改進和微調合並模型，以在各種基準和任務中實現更好的性能。

引用

分數來自arize/phoenix

📚 詳細文檔

訓練數據

本模型的訓練數據引用自以下文獻： @misc{HaluEval, author = {Junyi Li and Xiaoxue Cheng and Wayne Xin Zhao and Jian - Yun Nie and Ji - Rong Wen }, title = {HaluEval: A Large - Scale Hallucination Evaluation Benchmark for Large Language Models}, year = {2023}, journal={arXiv preprint arXiv:2305.11747}, url={https://arxiv.org/abs/2305.11747} }