opus-mt-tc-big-en-zle开源翻译模型 - 免费将英语精准译成白俄、俄、乌语

Home

Opus Mt Tc Big En Zle

Developed by Helsinki-NLP

这是一个从英语翻译至东斯拉夫语族（包括白俄罗斯语、俄语、乌克兰语）的神经机器翻译模型，属于OPUS-MT项目的一部分。

机器翻译

Transformers

Supports Multiple Languages#多语言翻译 #东斯拉夫语族 #高精度BLEU

Downloads 565

Release Time : 3/24/2022

Model Overview

该模型专注于英语到东斯拉夫语族的翻译任务，支持白俄罗斯语、俄语和乌克兰语。采用transformer-big架构，训练数据来源于OPUS语料库。

Model Features

多语言支持

支持英语到三种东斯拉夫语族语言的翻译（白俄罗斯语、俄语、乌克兰语）。

高性能翻译

在多个测试集上表现出色，如俄语翻译在Tatoeba测试集上BLEU得分达45.5。

基于Transformer架构

采用transformer-big架构，提供高质量的翻译结果。

Model Capabilities

英语到白俄罗斯语翻译

英语到俄语翻译

英语到乌克兰语翻译

多语言机器翻译

Use Cases

文本翻译

日常用语翻译

将英语日常用语翻译成东斯拉夫语族语言。

在Tatoeba测试集上表现良好

新闻内容翻译

将英语新闻内容翻译成俄语等语言。

在newstest2014上BLEU得分43.5

跨语言交流

多语言沟通辅助

帮助英语使用者与东斯拉夫语族使用者进行沟通。

🚀 opus-mt-tc-big-en-zle

这是一个用于将英语（en）翻译成东斯拉夫语系（zle）的神经机器翻译模型。该模型属于[OPUS - MT项目](https://github.com/Helsinki - NLP/Opus - MT)的一部分，此项目致力于让全球多种语言的神经机器翻译模型广泛可用。所有模型最初使用[Marian NMT](https://marian - nmt.github.io/)这一出色的框架进行训练，它是一个用纯C++编写的高效NMT实现。这些模型通过huggingface的transformers库转换为pyTorch格式。训练数据来自OPUS，训练流程采用[OPUS - MT - train](https://github.com/Helsinki - NLP/Opus - MT - train)的方法。

相关出版物：[OPUS - MT – Building open translation services for the World](https://aclanthology.org/2020.eamt - 1.61/) 和 [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt - 1.139/)（如果使用此模型，请引用）

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

✨ 主要特性

支持从英语翻译到东斯拉夫语系的多种语言。
作为OPUS - MT项目的一部分，具有广泛的可用性和可访问性。
基于高效的Marian NMT框架训练，并转换为pyTorch格式。

📦 安装指南

文档中未提及具体安装步骤，故跳过此章节。

💻 使用示例

基础用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>rus<< Are they coming as well?",
    ">>rus<< I didn't let Tom do what he wanted to do."
]

model_name = "pytorch-models/opus-mt-tc-big-en-zle"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Они тоже приедут?
#     Я не позволил Тому сделать то, что он хотел.

高级用法

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-zle")
print(pipe(">>rus<< Are they coming as well?"))

# expected output: Они тоже приедут?

📚 详细文档

模型信息

属性	详情
模型类型	transformer - big
训练数据	opusTCv20210807+bt ([source](https://github.com/Helsinki - NLP/Tatoeba - Challenge))
发布时间	2022 - 03 - 13
源语言	英语（eng）
目标语言	白俄罗斯语（bel）、俄语（rus）、乌克兰语（ukr）
有效目标语言标签	>>bel<< >>rus<< >>ukr<<
原始模型	[opusTCv20210807+bt_transformer - big_2022 - 03 - 13.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/eng - zle/opusTCv20210807+bt_transformer - big_2022 - 03 - 13.zip)
更多已发布模型信息	[OPUS - MT eng - zle README](https://github.com/Helsinki - NLP/Tatoeba - Challenge/tree/master/models/eng - zle/README.md)
更多模型相关信息	MarianMT

这是一个支持多种目标语言的多语言翻译模型。输入句子时需要以 >>id<<（id为有效的目标语言ID）的形式添加初始语言标记，例如 >>bel<<。

基准测试

语言对	测试集	chr - F	BLEU	句子数量	单词数量
eng - bel	tatoeba - test - v2021 - 08 - 07	0.50345	24.9	2500	16237
eng - rus	tatoeba - test - v2021 - 08 - 07	0.66182	45.5	19425	134296
eng - ukr	tatoeba - test - v2021 - 08 - 07	0.60175	37.7	13127	80998
eng - bel	flores101 - devtest	0.42078	11.2	1012	24829
eng - rus	flores101 - devtest	0.59654	32.7	1012	23295
eng - ukr	flores101 - devtest	0.60131	32.1	1012	22810
eng - rus	newstest2012	0.62842	36.8	3003	64790
eng - rus	newstest2013	0.54627	26.9	3000	58560
eng - rus	newstest2014	0.68348	43.5	3003	61603
eng - rus	newstest2015	0.62621	34.9	2818	55915
eng - rus	newstest2016	0.60595	33.1	2998	62014
eng - rus	newstest2017	0.64249	37.3	3001	60253
eng - rus	newstest2018	0.61219	32.9	3000	61907
eng - rus	newstest2019	0.57902	31.8	1997	48147
eng - rus	newstest2020	0.52939	25.5	2002	47083
eng - rus	tico19 - test	0.59314	33.7	2100	55843

测试集翻译结果：[opusTCv20210807+bt_transformer - big_2022 - 03 - 13.test.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/eng - zle/opusTCv20210807+bt_transformer - big_2022 - 03 - 13.test.txt)
测试集得分：[opusTCv20210807+bt_transformer - big_2022 - 03 - 13.eval.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/eng - zle/opusTCv20210807+bt_transformer - big_2022 - 03 - 13.eval.txt)
基准测试结果：benchmark_results.txt
基准测试输出：benchmark_translations.zip

致谢

这项工作得到了以下项目的支持：

[欧洲语言网格](https://www.european - language - grid.eu/) 的 [试点项目2866](https://live.european - language - grid.eu/catalogue/#/resource/projects/2866)。
[FoTran项目](https://www.helsinki.fi/en/researchgroups/natural - language - understanding - with - cross - lingual - grounding)，由欧盟的“地平线2020”研究与创新计划（资助协议编号771113）下的欧洲研究理事会（ERC）资助。
MeMAD项目，由欧盟的“地平线2020”研究与创新计划（资助协议编号780069）资助。

我们也感谢芬兰的CSC -- IT Center for Science提供的慷慨计算资源和IT基础设施。