opus - mt - tc - bible - big - roa - deu_eng_fra_por_spa开源翻译模型，多罗曼语到多语言翻译必备

首页

Opus Mt Tc Bible Big Roa Deu Eng Fra Por Spa

由 Helsinki-NLP 开发

这是一个多目标语言的神经机器翻译模型，专门用于从多种罗曼语族语言翻译至德语、英语、法语、葡萄牙语和西班牙语。

机器翻译

Transformers

支持多种语言开源协议:Apache-2.0 #多语言圣经翻译 #罗曼语族支持 #高精度BLEU55.6

下载量 25

发布时间 : 10/8/2024

模型简介

该模型是OPUS-MT项目的一部分，旨在为全球多种语言提供广泛可用的神经机器翻译模型。支持从安的列斯克里奥尔语、阿拉贡语等多种罗曼语族语言翻译至德语、英语、法语、葡萄牙语和西班牙语。

模型特点

多目标语言支持

支持从多种罗曼语族语言翻译至德语、英语、法语、葡萄牙语和西班牙语。

高性能翻译

在tatoeba-test-v2020-07-28-v2023-09-26数据集上达到BLEU 55.6和chr-F 0.73367的高分。

广泛的语言覆盖

支持40多种源语言和5种目标语言，涵盖多种罗曼语族语言和克里奥尔语。

模型能力

文本翻译

多语言支持

神经机器翻译

使用案例

语言翻译

多语言文档翻译

将多种罗曼语族语言的文档翻译成德语、英语、法语、葡萄牙语或西班牙语。

高质量的翻译结果，适用于商业、教育和研究用途。

跨语言交流

帮助用户在不同语言之间进行实时交流。

快速准确的翻译，提升沟通效率。

🚀 opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa

这是一款用于将罗曼语族语言翻译成德语、英语、法语、葡萄牙语和西班牙语的神经机器翻译模型。它属于 OPUS - MT 项目，借助 Marian NMT 框架训练，并通过 Hugging Face 的 transformers 库转换为 PyTorch 模型。

🚀 快速开始

代码示例

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

你也可以使用 transformers 管道来使用 OPUS - MT 模型，例如：

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

✨ 主要特性

支持多种罗曼语族语言到德语、英语、法语、葡萄牙语和西班牙语的翻译。
属于 OPUS - MT 项目，该项目致力于让神经机器翻译模型广泛可用。
采用 Marian NMT 框架训练，这是一个高效的纯 C++ 实现的 NMT 框架。
通过 transformers 库转换为 PyTorch 模型，方便使用。

📦 安装指南

文档未提供具体安装步骤，可参考 transformers 库官方文档进行安装。

💻 使用示例

基础用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

高级用法

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

📚 详细文档

模型详情

该模型是用于将罗曼语族语言（roa）翻译成德语、英语、法语、葡萄牙语和西班牙语的神经机器翻译模型。

此模型是 OPUS - MT 项目的一部分，该项目致力于让神经机器翻译模型在世界多种语言中广泛可用。所有模型最初使用 Marian NMT 这一出色的框架进行训练，这是一个用纯 C++ 编写的高效 NMT 实现。这些模型通过 Hugging Face 的 transformers 库转换为 PyTorch 模型。训练数据来自 OPUS，训练管道采用 OPUS - MT - train 的流程。

模型描述：

属性	详情
开发者	赫尔辛基大学语言技术研究组
模型类型	翻译（transformer - big）
发布时间	2024 - 05 - 30
许可证	Apache - 2.0
源语言	acf、arg、ast、cat、cbk、cos、crs、egl、ext、fra、frm、fro、frp、fur、gcf、glg、hat、ita、kea、lad、lij、lld、lmo、lou、mfe、mol、mwl、nap、oci、osp、pap、pcd、pms、por、roh、ron、rup、scn、spa、srd、vec、wln
目标语言	deu、eng、fra、por、spa
有效目标语言标签	>>deu<<、>>eng<<、>>fra<<、>>por<<、>>spa<<、>>xxx<<
原始模型	[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 30.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - deu+eng+fra+por+spa/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 30.zip)
更多信息资源	[OPUS - MT 仪表盘](https://opus.nlpl.eu/dashboard/index.php?pkg = opusmt&test = all&scoreslang = all&chart = standard&model = Tatoeba - MT - models/roa - deu%2Beng%2Bfra%2Bpor%2Bspa/opusTCv20230926max50%2Bbt%2Bjhubc_transformer - big_2024 - 05 - 30)、[OPUS - MT - train GitHub 仓库](https://github.com/Helsinki - NLP/OPUS - MT - train)、transformers 库中 MarianNMT 模型的更多信息、[Tatoeba 翻译挑战](https://github.com/Helsinki - NLP/Tatoeba - Challenge/)、[HPLT 双语数据 v1（作为 Tatoeba 翻译挑战数据集的一部分）](https://hplt - project.org/datasets/v1)、[大规模并行圣经语料库](https://aclanthology.org/L14 - 1215/)

这是一个具有多个目标语言的多语言翻译模型。句子开头需要使用 >>id<< 形式的语言标记（id 为有效的目标语言 ID），例如 >>deu<<。

用途

该模型可用于翻译和文本到文本的生成。

风险、限制和偏差

⚠️ 重要提示

读者应注意，该模型是在各种公共数据集上训练的，这些数据集可能包含令人不安、冒犯性的内容，并可能传播历史和当前的刻板印象。

已有大量研究探讨了语言模型的偏差和公平性问题（例如，参见 [Sheng 等人 (2021)](https://aclanthology.org/2021.acl - long.330.pdf) 和 Bender 等人 (2021)）。

训练

数据：opusTCv20230926max50 + bt + jhubc ([来源](https://github.com/Helsinki - NLP/Tatoeba - Challenge))
预处理：SentencePiece（spm32k, spm32k）
模型类型：transformer - big
原始 MarianNMT 模型：[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 30.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - deu+eng+fra+por+spa/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 30.zip)
训练脚本：[GitHub 仓库](https://github.com/Helsinki - NLP/OPUS - MT - train)

评估

[OPUS - MT 仪表盘上的模型得分](https://opus.nlpl.eu/dashboard/index.php?pkg = opusmt&test = all&scoreslang = all&chart = standard&model = Tatoeba - MT - models/roa - deu%2Beng%2Bfra%2Bpor%2Bspa/opusTCv20230926max50%2Bbt%2Bjhubc_transformer - big_2024 - 05 - 30)
测试集翻译：[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 29.test.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - deu+eng+fra+por+spa/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 29.test.txt)
测试集得分：[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 29.eval.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - deu+eng+fra+por+spa/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 29.eval.txt)
基准测试结果：benchmark_results.txt
基准测试输出：benchmark_translations.zip

语言对	测试集	chr - F	BLEU	句子数量	单词数量
multi - multi	tatoeba - test - v2020 - 07 - 28 - v2023 - 09 - 26	0.73367	55.6	10000	83852

引用信息

出版物：[Democratizing neural machine translation with OPUS - MT](https://doi.org/10.1007/s10579 - 023 - 09704 - w)、[OPUS - MT – Building open translation services for the World](https://aclanthology.org/2020.eamt - 1.61/)、[The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt - 1.139/)（如果使用此模型，请引用）

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

致谢

这项工作得到了 [HPLT 项目](https://hplt - project.org/) 的支持，该项目由欧盟的 Horizon Europe 研究与创新计划资助，资助协议编号为 101070350。我们也感谢 CSC -- 芬兰科学信息技术中心和 [EuroHPC 超级计算机 LUMI](https://www.lumi - supercomputer.eu/) 提供的慷慨计算资源和 IT 基础设施。