M-Prometheus-7B开源评估模型 - 免费使用，支持多语言输出评估

首页

M Prometheus 7B

由 Unbabel 开发

M-Prometheus是一套开源的LLM评估模型，能够原生支持多语言输出的评估。基于48万条多语言直接评估和成对比较数据训练而成。

大型语言模型

Transformers

开源协议:其他 #多语言评估 #LLM质量评测 #翻译质量评分

下载量 238

发布时间 : 4/7/2025

模型简介

开源多语言LLM评估套件，支持多语言输出的评估，与Prometheus-2兼容。

模型特点

多语言评估

原生支持多语言输出的评估，基于48万条多语言数据训练

兼容性

使用方式与Prometheus-2完全兼容

长文本反馈

支持包含长文本反馈的评估

模型能力

多语言文本评估

机器翻译质量评估

生成详细评估反馈

使用案例

机器翻译评估

翻译质量评估

评估从源语言到目标语言的翻译质量

提供1-5分的评分及详细反馈

LLM输出评估

多语言生成评估

评估多语言LLM的生成质量

提供准确性、流畅度、风格等多维度评估

🚀 M-Prometheus

M-Prometheus是一套开源的大语言模型评估器，能够原生评估多语言输出。它们在48万个多语言直接评估和成对比较实例数据上进行训练，并带有详细反馈。可以像使用Prometheus-2一样对其进行提示。更多详细信息请查看我们的论文。

🚀 快速开始

M-Prometheus可用于原生评估多语言输出，为多语言评估提供了有效的解决方案。

✨ 主要特性

能够原生评估多语言输出。
在480k实例的多语言直接评估和成对比较数据上进行训练，并带有长格式反馈。
可以像Prometheus-2一样进行提示。

💻 使用示例

基础用法

"""###Task Description: An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given. 
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general. 
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric. 
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)" 
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
Translate the following text from {source_language} to {target_language}: {source}

###Response to evaluate:
{hypothesis}

###Reference Answer (Score 5):
{reference}

###Score Rubrics: [Accuracy, Fluency, Style]
Score 1: The translation contains major errors that significantly alter the meaning of the source text. It is barely comprehensible and reads like a poor machine translation. The style is completely inconsistent with the source text.
Score 2: The translation has several inaccuracies that affect the overall meaning. It is difficult to read and understand, with frequent awkward phrasings. The style only occasionally matches the source text.
Score 3: The translation is mostly accurate but has some minor errors that don't significantly alter the meaning. It is generally understandable but lacks natural flow in some parts. The style is somewhat consistent with the source text.
Score 4: The translation is accurate with only a few negligible errors. It reads naturally for the most part, with occasional minor awkwardness. The style largely matches that of the source text.
Score 5: The translation is highly accurate, conveying the full meaning of the source text. It reads as fluently as an original text in the target language. The style perfectly captures the tone and register of the source text.

###Feedback:
"""

📄 许可证

许可证类型：其他

📚 详细文档

属性	详情
库名称	transformers
基础模型	Qwen/Qwen2.5-7B-Instruct

📚 引用

@misc{pombal2025mprometheussuiteopenmultilingual,
      title={M-Prometheus: A Suite of Open Multilingual LLM Judges}, 
      author={José Pombal and Dongkeun Yoon and Patrick Fernandes and Ian Wu and Seungone Kim and Ricardo Rei and Graham Neubig and André F. T. Martins},
      year={2025},
      eprint={2504.04953},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.04953}, 
}