opus-mt-tc-bible-big-roa-en开源翻译模型 - 免费实现罗曼语系到英语翻译

首页

Opus Mt Tc Bible Big Roa En

由 Helsinki-NLP 开发

这是一个用于将罗曼语系（roa）语言翻译成英语（en）的神经机器翻译模型，属于OPUS-MT项目的一部分。

机器翻译

Transformers

支持多种语言开源协议:Apache-2.0 #罗曼语系翻译 #圣经语料训练 #多语言支持

下载量 2,985

发布时间 : 10/8/2024

模型简介

该模型专门用于将多种罗曼语系语言翻译成英语，基于Transformer架构训练，适用于文本翻译任务。

模型特点

多语言支持

支持多种罗曼语系语言到英语的翻译

高质量翻译

基于OPUS数据集训练，提供高质量的翻译结果

易于集成

可通过Hugging Face Transformers库轻松集成到应用中

模型能力

文本翻译

多语言处理

使用案例

语言翻译

文档翻译

将罗曼语系语言的文档翻译成英语

高质量的英语翻译结果

实时翻译

用于实时聊天或会议的翻译服务

快速准确的翻译响应

🚀 opus-mt-tc-bible-big-roa-en

这是一个用于将罗曼语系（roa）语言翻译成英语（en）的神经机器翻译模型。它属于OPUS - MT项目的一部分，旨在让神经机器翻译模型广泛可用，为世界上多种语言提供服务。

🚀 快速开始

简单示例代码

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "É caro demais.",
    "Estamos muertos."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-roa-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     It's too expensive.
#     We're dead.

你也可以使用transformers的pipeline来使用OPUS - MT模型，例如：

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-en")
print(pipe("É caro demais."))

# expected output: It's too expensive.

✨ 主要特性

支持多种罗曼语系语言到英语的翻译。
属于OPUS - MT项目，借助Marian NMT框架训练，后转换为pyTorch模型。
训练数据来源于OPUS，训练流程遵循OPUS - MT - train的程序。

📦 安装指南

文档未提及安装步骤，暂不提供。

💻 使用示例

基础用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "É caro demais.",
    "Estamos muertos."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-roa-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     It's too expensive.
#     We're dead.

高级用法

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-en")
print(pipe("É caro demais."))

# expected output: It's too expensive.

📚 详细文档

模型详情

这是一个用于将罗曼语系（roa）语言翻译成英语（en）的神经机器翻译模型。

该模型是[OPUS - MT项目](https://github.com/Helsinki - NLP/Opus - MT)的一部分，该项目致力于让神经机器翻译模型广泛可用，为世界上多种语言提供服务。所有模型最初使用[Marian NMT](https://marian - nmt.github.io/)框架进行训练，这是一个用纯C++编写的高效NMT实现。这些模型通过huggingface的transformers库转换为pyTorch模型。训练数据来自OPUS，训练流程采用[OPUS - MT - train](https://github.com/Helsinki - NLP/Opus - MT - train)的程序。

属性	详情
开发团队	赫尔辛基大学语言技术研究小组
模型类型	翻译（transformer - big）
发布时间	2024 - 08 - 17
许可证	Apache - 2.0
源语言	acf arg ast cat cbk cos egl ext fra frm frp fur gcf glg hat ita kea lad lij lld lmo lou mfe mol mwl nap oci osp pap pms por roh ron rup scn spa srd vec wln
目标语言	eng
原始模型	[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - eng/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.zip)
更多信息资源	[OPUS - MT仪表盘](https://opus.nlpl.eu/dashboard/index.php?pkg=opusmt&test=all&scoreslang=all&chart=standard&model=Tatoeba - MT - models/roa - eng/opusTCv20230926max50%2Bbt%2Bjhubc_transformer - big_2024 - 08 - 17) [OPUS - MT - train GitHub仓库](https://github.com/Helsinki - NLP/OPUS - MT - train) transformers库中关于MarianNMT模型的更多信息 [Tatoeba翻译挑战](https://github.com/Helsinki - NLP/Tatoeba - Challenge/) [HPLT双语数据v1（作为Tatoeba翻译挑战数据集的一部分）](https://hplt - project.org/datasets/v1) [大规模并行圣经语料库](https://aclanthology.org/L14 - 1215/)

用途

此模型可用于翻译和文本到文本的生成。

风险、限制和偏差

⚠️ 重要提示

读者应注意，该模型是在各种公共数据集上训练的，这些数据集可能包含令人不安、冒犯性的内容，并可能传播历史和当前的刻板印象。

已有大量研究探讨了语言模型的偏差和公平性问题（例如，参见[Sheng等人（2021）](https://aclanthology.org/2021.acl - long.330.pdf)和Bender等人（2021））。

训练

数据：opusTCv20230926max50+bt+jhubc（[来源](https://github.com/Helsinki - NLP/Tatoeba - Challenge)）
预处理：SentencePiece（spm32k,spm32k）
模型类型：transformer - big
原始MarianNMT模型：[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - eng/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.zip)
训练脚本：[GitHub仓库](https://github.com/Helsinki - NLP/OPUS - MT - train)

评估

[OPUS - MT仪表盘上的模型得分](https://opus.nlpl.eu/dashboard/index.php?pkg=opusmt&test=all&scoreslang=all&chart=standard&model=Tatoeba - MT - models/roa - eng/opusTCv20230926max50%2Bbt%2Bjhubc_transformer - big_2024 - 08 - 17)
测试集翻译：[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.test.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - eng/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.test.txt)
测试集得分：[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.eval.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - eng/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.eval.txt)
基准测试结果：benchmark_results.txt
基准测试输出：benchmark_translations.zip

语言对	测试集	chr - F	BLEU	句子数量	单词数量
multi - eng	tatoeba - test - v2020 - 07 - 28 - v2023 - 09 - 26	0.76737	62.8	10000	87576

引用信息

出版物：[通过OPUS - MT实现神经机器翻译的民主化](https://doi.org/10.1007/s10579 - 023 - 09704 - w)、[OPUS - MT – 为世界构建开放翻译服务](https://aclanthology.org/2020.eamt - 1.61/)和[塔托埃巴翻译挑战 – 低资源和多语言机器翻译的现实数据集](https://aclanthology.org/2020.wmt - 1.139/)（如果使用此模型，请引用）。

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

致谢

这项工作得到了[HPLT项目](https://hplt - project.org/)的支持，该项目由欧盟的“地平线欧洲”研究与创新计划资助，资助协议编号为101070350。我们也感谢芬兰CSC - 科学信息技术中心和[欧洲高性能计算机LUMI](https://www.lumi - supercomputer.eu/)提供的慷慨计算资源和IT基础设施。