opus-mt-tc-big-lt-en開源翻譯模型 - 免費便捷實現立陶宛語到英語翻譯

首頁

Opus Mt Tc Big Lt En

由Helsinki-NLP開發

這是一個用於從立陶宛語翻譯到英語的神經機器翻譯模型，屬於OPUS-MT項目的一部分。

機器翻譯

Transformers

支持多種語言#立陶宛語-英語翻譯 #高精度機器翻譯 #多語言支持

下載量 312

發布時間 : 4/13/2022

模型概述

該模型專門用於立陶宛語到英語的翻譯任務，基於transformer-big架構，使用SentencePiece分詞器。

模型特點

多數據集訓練

模型在多個數據集上進行訓練，包括OPUS和Tatoeba-Challenge數據。

高性能翻譯

在多個測試集上表現出色，BLEU分數在32.3到61.6之間。

支持SentencePiece分詞

使用spm32k分詞器進行文本處理，提高翻譯質量。

模型能力

立陶宛語到英語的文本翻譯

支持長文本翻譯

支持批量翻譯

使用案例

文本翻譯

日常用語翻譯

將立陶宛語的日常用語翻譯成英語。

在Tatoeba測試集上達到61.6 BLEU分數

新聞翻譯

將立陶宛語新聞內容翻譯成英語。

在newstest2019測試集上達到32.3 BLEU分數

🚀 opus-mt-tc-big-lt-en

這是一個用於從立陶宛語（lt）翻譯到英語（en）的神經機器翻譯模型。該模型是OPUS - MT項目的一部分，此項目致力於讓神經機器翻譯模型在世界多種語言中廣泛可用且易於獲取。所有模型最初使用Marian NMT這一出色框架進行訓練，它是一個用純C++編寫的高效NMT實現。這些模型已通過huggingface的transformers庫轉換為pyTorch。訓練數據來自OPUS，訓練流程採用OPUS - MT - train的程序。

相關出版物：OPUS - MT – Building open translation services for the World 和 The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT（如果使用此模型，請引用）

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

🚀 快速開始

模型信息

屬性	詳情
發佈時間	2022 - 02 - 25
源語言	立陶宛語（lit）
目標語言	英語（eng）
模型類型	transformer - big
訓練數據	opusTCv20210807+bt (來源)
分詞方式	SentencePiece (spm32k,spm32k)
原始模型	opusTCv20210807+bt_transformer - big_2022 - 02 - 25.zip
更多信息	OPUS - MT lit - eng README

許可證

本模型使用的許可證為cc - by - 4.0。

支持語言

本模型支持立陶宛語（lt）和英語（en）的翻譯。

💻 使用示例

基礎用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "Katė sedėjo ant kėdės.",
    "Jukiko mėgsta bulves."
]

model_name = "pytorch-models/opus-mt-tc-big-lt-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# 預期輸出:
#     The cat sat on a chair.
#     Yukiko likes potatoes.

高級用法

你也可以使用transformers的pipeline來使用OPUS - MT模型，例如：

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-lt-en")
print(pipe("Katė sedėjo ant kėdės."))

# 預期輸出: The cat sat on a chair.

📊 基準測試

測試集翻譯結果：opusTCv20210807+bt_transformer - big_2022 - 02 - 25.test.txt
測試集得分：opusTCv20210807+bt_transformer - big_2022 - 02 - 25.eval.txt
基準測試結果：benchmark_results.txt
基準測試輸出：benchmark_translations.zip

語言對	測試集	chr - F	BLEU	句子數量	單詞數量
lit - eng	tatoeba - test - v2021 - 08 - 07	0.74881	61.6	2528	17855
lit - eng	flores101 - devtest	0.60662	34.3	1012	24721
lit - eng	newsdev2019	0.59995	32.9	2000	49312
lit - eng	newstest2019	0.61742	32.3	1000	25878

🙏 致謝

這項工作得到了以下項目的支持：

[歐洲語言網格](https://www.european - language - grid.eu/)的[試點項目2866](https://live.european - language - grid.eu/catalogue/#/resource/projects/2866)。
[FoTran項目](https://www.helsinki.fi/en/researchgroups/natural - language - understanding - with - cross - lingual - grounding)，由歐盟的“地平線2020”研究和創新計劃（資助協議編號771113）下的歐洲研究理事會（ERC）資助。
MeMAD項目，由歐盟的“地平線2020”研究和創新計劃資助（資助協議編號780069）。

我們也感謝芬蘭科學信息技術中心（CSC）提供的慷慨計算資源和IT基礎設施。