opus-mt-tc-big-en-es開源翻譯模型 - 免費實現英語到西班牙語精準翻譯

首頁

Opus Mt Tc Big En Es

由Helsinki-NLP開發

OPUS-MT項目提供的英語到西班牙語神經機器翻譯模型，基於transformer-big架構

機器翻譯

Transformers

支持多種語言#英西翻譯 #高BLEU值 #多領域適配

下載量 29.31k

發布時間 : 4/13/2022

模型概述

該模型是OPUS-MT項目的一部分，專門用於從英語(en)到西班牙語(es)的神經機器翻譯。模型使用Marian NMT框架訓練，並通過transformers庫轉換為pyTorch格式。

模型特點

高質量翻譯

在多個測試集上表現出色，BLEU分數在28.5到57.2之間

多領域適用

在新聞、日常對話等多種文本類型上均有良好表現

開源許可

採用cc-by-4.0許可證，允許商業和研究使用

模型能力

英語到西班牙語文本翻譯

多領域文本處理

批量翻譯

使用案例

內容本地化

新聞翻譯

將英語新聞內容翻譯為西班牙語

在news-test2008測試集上BLEU得分30.1

日常對話翻譯

翻譯日常對話和簡單句子

在tatoeba-test-v2021-08-07測試集上BLEU得分57.2

教育

語言學習輔助

為語言學習者提供翻譯參考

🚀 opus-mt-tc-big-en-es

opus-mt-tc-big-en-es 是一款用於將英語（en）翻譯成西班牙語（es）的神經機器翻譯模型。該模型屬於 OPUS - MT 項目的一部分，此項目致力於讓神經機器翻譯模型在全球多種語言中廣泛可用且易於獲取。所有模型最初使用 Marian NMT 這一出色的框架進行訓練，它是一個用純 C++ 編寫的高效神經機器翻譯實現。這些模型已通過 huggingface 的 transformers 庫轉換為 PyTorch 格式。訓練數據來自 OPUS，訓練流程採用 OPUS - MT - train 的方法。

相關出版物：OPUS - MT – Building open translation services for the World 和 The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT（如果您使用此模型，請進行引用）

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

🚀 快速開始

本模型可用於將英語翻譯成西班牙語，以下是使用示例。

✨ 主要特性

屬於 OPUS - MT 項目，致力於提供多語言的機器翻譯模型。
基於 Marian NMT 框架訓練，後轉換為 PyTorch 格式。
訓練數據來自 OPUS，採用 OPUS - MT - train 的訓練流程。

📦 安裝指南

文檔未提及安裝步驟，暫不提供。

💻 使用示例

基礎用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "A wasp stung him and he had an allergic reaction.",
    "I love nature."
]

model_name = "pytorch-models/opus-mt-tc-big-en-es"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Una avispa lo picó y tuvo una reacción alérgica.
#     Me encanta la naturaleza.

高級用法

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-es")
print(pipe("A wasp stung him and he had an allergic reaction."))

# expected output: Una avispa lo picó y tuvo una reacción alérgica.

📚 詳細文檔

模型信息

屬性	詳情
發佈時間	2022 - 03 - 13
源語言	英語（eng）
目標語言	西班牙語（spa）
模型類型	transformer - big
訓練數據	opusTCv20210807 + bt (來源)
分詞方式	SentencePiece (spm32k, spm32k)
原始模型	[opusTCv20210807 + bt_transformer - big_2022 - 03 - 13.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/eng - spa/opusTCv20210807 + bt_transformer - big_2022 - 03 - 13.zip)
更多信息	[OPUS - MT eng - spa README](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/eng - spa/README.md)

基準測試

測試集翻譯結果：[opusTCv20210807 + bt_transformer - big_2022 - 03 - 13.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng - spa/opusTCv20210807 + bt_transformer - big_2022 - 03 - 13.test.txt)
測試集得分：[opusTCv20210807 + bt_transformer - big_2022 - 03 - 13.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng - spa/opusTCv20210807 + bt_transformer - big_2022 - 03 - 13.eval.txt)
基準測試結果：benchmark_results.txt
基準測試輸出：benchmark_translations.zip

語言對	測試集	chr - F	BLEU	句子數量	單詞數量
eng - spa	tatoeba - test - v2021 - 08 - 07	0.73863	57.2	16583	134710
eng - spa	flores101 - devtest	0.56440	28.5	1012	29199
eng - spa	newssyscomb2009	0.58415	31.5	502	12503
eng - spa	news - test2008	0.56707	30.1	2051	52586
eng - spa	newstest2009	0.57836	30.2	2525	68111
eng - spa	newstest2010	0.62357	37.6	2489	65480
eng - spa	newstest2011	0.62415	38.9	3003	79476
eng - spa	newstest2012	0.63031	39.5	3003	79006
eng - spa	newstest2013	0.60354	35.9	3000	70528
eng - spa	tico19 - test	0.73554	53.0	2100	66563

🔧 技術細節

文檔未提及具體技術實現細節，暫不提供。

📄 許可證

本模型使用的許可證為 cc - by - 4.0。

致謝

本工作得到了以下項目的支持：

[歐洲語言網格](https://www.european - language - grid.eu/) 的 [試點項目 2866](https://live.european - language - grid.eu/catalogue/#/resource/projects/2866)。
[FoTran 項目](https://www.helsinki.fi/en/researchgroups/natural - language - understanding - with - cross - lingual - grounding)，由歐盟的“地平線 2020”研究與創新計劃（資助協議編號 771113）資助。
MeMAD 項目，由歐盟的“地平線 2020”研究與創新計劃（資助協議編號 780069）資助。

同時，我們感謝芬蘭科學信息技術中心（CSC）提供的慷慨計算資源和 IT 基礎設施。