模型简介
模型特点
模型能力
使用案例
🚀 opus-mt-tc-bible-big-roa-en
这是一个用于将罗曼语系(roa)语言翻译成英语(en)的神经机器翻译模型。它属于OPUS - MT项目的一部分,旨在让神经机器翻译模型广泛可用,为世界上多种语言提供服务。
🚀 快速开始
简单示例代码
from transformers import MarianMTModel, MarianTokenizer
src_text = [
"É caro demais.",
"Estamos muertos."
]
model_name = "pytorch-models/opus-mt-tc-bible-big-roa-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
for t in translated:
print( tokenizer.decode(t, skip_special_tokens=True) )
# expected output:
# It's too expensive.
# We're dead.
你也可以使用transformers的pipeline来使用OPUS - MT模型,例如:
from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-en")
print(pipe("É caro demais."))
# expected output: It's too expensive.
✨ 主要特性
- 支持多种罗曼语系语言到英语的翻译。
- 属于OPUS - MT项目,借助Marian NMT框架训练,后转换为pyTorch模型。
- 训练数据来源于OPUS,训练流程遵循OPUS - MT - train的程序。
📦 安装指南
文档未提及安装步骤,暂不提供。
💻 使用示例
基础用法
from transformers import MarianMTModel, MarianTokenizer
src_text = [
"É caro demais.",
"Estamos muertos."
]
model_name = "pytorch-models/opus-mt-tc-bible-big-roa-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
for t in translated:
print( tokenizer.decode(t, skip_special_tokens=True) )
# expected output:
# It's too expensive.
# We're dead.
高级用法
from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-en")
print(pipe("É caro demais."))
# expected output: It's too expensive.
📚 详细文档
模型详情
这是一个用于将罗曼语系(roa)语言翻译成英语(en)的神经机器翻译模型。
该模型是[OPUS - MT项目](https://github.com/Helsinki - NLP/Opus - MT)的一部分,该项目致力于让神经机器翻译模型广泛可用,为世界上多种语言提供服务。所有模型最初使用[Marian NMT](https://marian - nmt.github.io/)框架进行训练,这是一个用纯C++编写的高效NMT实现。这些模型通过huggingface的transformers库转换为pyTorch模型。训练数据来自OPUS,训练流程采用[OPUS - MT - train](https://github.com/Helsinki - NLP/Opus - MT - train)的程序。
属性 | 详情 |
---|---|
开发团队 | 赫尔辛基大学语言技术研究小组 |
模型类型 | 翻译(transformer - big) |
发布时间 | 2024 - 08 - 17 |
许可证 | Apache - 2.0 |
源语言 | acf arg ast cat cbk cos egl ext fra frm frp fur gcf glg hat ita kea lad lij lld lmo lou mfe mol mwl nap oci osp pap pms por roh ron rup scn spa srd vec wln |
目标语言 | eng |
原始模型 | [opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - eng/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.zip) |
更多信息资源 | [OPUS - MT仪表盘](https://opus.nlpl.eu/dashboard/index.php?pkg=opusmt&test=all&scoreslang=all&chart=standard&model=Tatoeba - MT - models/roa - eng/opusTCv20230926max50%2Bbt%2Bjhubc_transformer - big_2024 - 08 - 17) [OPUS - MT - train GitHub仓库](https://github.com/Helsinki - NLP/OPUS - MT - train) transformers库中关于MarianNMT模型的更多信息 [Tatoeba翻译挑战](https://github.com/Helsinki - NLP/Tatoeba - Challenge/) [HPLT双语数据v1(作为Tatoeba翻译挑战数据集的一部分)](https://hplt - project.org/datasets/v1) [大规模并行圣经语料库](https://aclanthology.org/L14 - 1215/) |
用途
此模型可用于翻译和文本到文本的生成。
风险、限制和偏差
⚠️ 重要提示
读者应注意,该模型是在各种公共数据集上训练的,这些数据集可能包含令人不安、冒犯性的内容,并可能传播历史和当前的刻板印象。
已有大量研究探讨了语言模型的偏差和公平性问题(例如,参见[Sheng等人(2021)](https://aclanthology.org/2021.acl - long.330.pdf)和Bender等人(2021))。
训练
- 数据:opusTCv20230926max50+bt+jhubc([来源](https://github.com/Helsinki - NLP/Tatoeba - Challenge))
- 预处理:SentencePiece(spm32k,spm32k)
- 模型类型:transformer - big
- 原始MarianNMT模型:[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - eng/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.zip)
- 训练脚本:[GitHub仓库](https://github.com/Helsinki - NLP/OPUS - MT - train)
评估
- [OPUS - MT仪表盘上的模型得分](https://opus.nlpl.eu/dashboard/index.php?pkg=opusmt&test=all&scoreslang=all&chart=standard&model=Tatoeba - MT - models/roa - eng/opusTCv20230926max50%2Bbt%2Bjhubc_transformer - big_2024 - 08 - 17)
- 测试集翻译:[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.test.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - eng/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.test.txt)
- 测试集得分:[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.eval.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - eng/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.eval.txt)
- 基准测试结果:benchmark_results.txt
- 基准测试输出:benchmark_translations.zip
语言对 | 测试集 | chr - F | BLEU | 句子数量 | 单词数量 |
---|---|---|---|---|---|
multi - eng | tatoeba - test - v2020 - 07 - 28 - v2023 - 09 - 26 | 0.76737 | 62.8 | 10000 | 87576 |
引用信息
- 出版物:[通过OPUS - MT实现神经机器翻译的民主化](https://doi.org/10.1007/s10579 - 023 - 09704 - w)、[OPUS - MT – 为世界构建开放翻译服务](https://aclanthology.org/2020.eamt - 1.61/)和[塔托埃巴翻译挑战 – 低资源和多语言机器翻译的现实数据集](https://aclanthology.org/2020.wmt - 1.139/)(如果使用此模型,请引用)。
@article{tiedemann2023democratizing,
title={Democratizing neural machine translation with {OPUS-MT}},
author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
journal={Language Resources and Evaluation},
number={58},
pages={713--755},
year={2023},
publisher={Springer Nature},
issn={1574-0218},
doi={10.1007/s10579-023-09704-w}
}
@inproceedings{tiedemann-thottingal-2020-opus,
title = "{OPUS}-{MT} {--} Building open translation services for the World",
author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh},
booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
month = nov,
year = "2020",
address = "Lisboa, Portugal",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2020.eamt-1.61",
pages = "479--480",
}
@inproceedings{tiedemann-2020-tatoeba,
title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
author = {Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the Fifth Conference on Machine Translation",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.wmt-1.139",
pages = "1174--1182",
}
致谢
这项工作得到了[HPLT项目](https://hplt - project.org/)的支持,该项目由欧盟的“地平线欧洲”研究与创新计划资助,资助协议编号为101070350。我们也感谢芬兰CSC - 科学信息技术中心和[欧洲高性能计算机LUMI](https://www.lumi - supercomputer.eu/)提供的慷慨计算资源和IT基础设施。
模型转换信息
- transformers版本:4.45.1
- OPUS - MT git哈希值:0882077
- 转换时间:2024年10月8日 星期二 15:26:36 EEST
- 转换机器:LM0 - 400 - 22516.local
📄 许可证
本模型使用的许可证为Apache - 2.0。



