opus-mt-tc-big-lt-en开源翻译模型 - 免费便捷实现立陶宛语到英语翻译

首页

Opus Mt Tc Big Lt En

由 Helsinki-NLP 开发

这是一个用于从立陶宛语翻译到英语的神经机器翻译模型，属于OPUS-MT项目的一部分。

机器翻译

Transformers

支持多种语言#立陶宛语-英语翻译 #高精度机器翻译 #多语言支持

下载量 312

发布时间 : 4/13/2022

模型简介

该模型专门用于立陶宛语到英语的翻译任务，基于transformer-big架构，使用SentencePiece分词器。

模型特点

多数据集训练

模型在多个数据集上进行训练，包括OPUS和Tatoeba-Challenge数据。

高性能翻译

在多个测试集上表现出色，BLEU分数在32.3到61.6之间。

支持SentencePiece分词

使用spm32k分词器进行文本处理，提高翻译质量。

模型能力

立陶宛语到英语的文本翻译

支持长文本翻译

支持批量翻译

使用案例

文本翻译

日常用语翻译

将立陶宛语的日常用语翻译成英语。

在Tatoeba测试集上达到61.6 BLEU分数

新闻翻译

将立陶宛语新闻内容翻译成英语。

在newstest2019测试集上达到32.3 BLEU分数

🚀 opus-mt-tc-big-lt-en

这是一个用于从立陶宛语（lt）翻译到英语（en）的神经机器翻译模型。该模型是OPUS - MT项目的一部分，此项目致力于让神经机器翻译模型在世界多种语言中广泛可用且易于获取。所有模型最初使用Marian NMT这一出色框架进行训练，它是一个用纯C++编写的高效NMT实现。这些模型已通过huggingface的transformers库转换为pyTorch。训练数据来自OPUS，训练流程采用OPUS - MT - train的程序。

相关出版物：OPUS - MT – Building open translation services for the World 和 The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT（如果使用此模型，请引用）

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

🚀 快速开始

模型信息

属性	详情
发布时间	2022 - 02 - 25
源语言	立陶宛语（lit）
目标语言	英语（eng）
模型类型	transformer - big
训练数据	opusTCv20210807+bt (来源)
分词方式	SentencePiece (spm32k,spm32k)
原始模型	opusTCv20210807+bt_transformer - big_2022 - 02 - 25.zip
更多信息	OPUS - MT lit - eng README

许可证

本模型使用的许可证为cc - by - 4.0。

支持语言

本模型支持立陶宛语（lt）和英语（en）的翻译。

💻 使用示例

基础用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "Katė sedėjo ant kėdės.",
    "Jukiko mėgsta bulves."
]

model_name = "pytorch-models/opus-mt-tc-big-lt-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# 预期输出:
#     The cat sat on a chair.
#     Yukiko likes potatoes.

高级用法

你也可以使用transformers的pipeline来使用OPUS - MT模型，例如：

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-lt-en")
print(pipe("Katė sedėjo ant kėdės."))

# 预期输出: The cat sat on a chair.

📊 基准测试

测试集翻译结果：opusTCv20210807+bt_transformer - big_2022 - 02 - 25.test.txt
测试集得分：opusTCv20210807+bt_transformer - big_2022 - 02 - 25.eval.txt
基准测试结果：benchmark_results.txt
基准测试输出：benchmark_translations.zip

语言对	测试集	chr - F	BLEU	句子数量	单词数量
lit - eng	tatoeba - test - v2021 - 08 - 07	0.74881	61.6	2528	17855
lit - eng	flores101 - devtest	0.60662	34.3	1012	24721
lit - eng	newsdev2019	0.59995	32.9	2000	49312
lit - eng	newstest2019	0.61742	32.3	1000	25878

🙏 致谢

这项工作得到了以下项目的支持：

[欧洲语言网格](https://www.european - language - grid.eu/)的[试点项目2866](https://live.european - language - grid.eu/catalogue/#/resource/projects/2866)。
[FoTran项目](https://www.helsinki.fi/en/researchgroups/natural - language - understanding - with - cross - lingual - grounding)，由欧盟的“地平线2020”研究和创新计划（资助协议编号771113）下的欧洲研究理事会（ERC）资助。
MeMAD项目，由欧盟的“地平线2020”研究和创新计划资助（资助协议编号780069）。

我们也感谢芬兰科学信息技术中心（CSC）提供的慷慨计算资源和IT基础设施。