opus - mt - tc - big - en - pt開源翻譯模型 - 免費實現英語到葡萄牙語精準翻譯

首頁

Opus Mt Tc Big En Pt

由Helsinki-NLP開發

這是一個用於英語到葡萄牙語（包括巴西葡萄牙語）的神經機器翻譯模型，屬於OPUS-MT項目的一部分。

機器翻譯

Transformers

支持多種語言#英葡雙向翻譯 #多方言支持 #高BLEU得分

下載量 65.51k

發布時間 : 4/13/2022

模型概述

該模型專門用於將英語文本翻譯成葡萄牙語，支持巴西和葡萄牙的葡萄牙語變體。它基於transformer-big架構，使用SentencePiece進行分詞。

模型特點

多目標語言支持

支持翻譯到巴西葡萄牙語和葡萄牙葡萄牙語，通過在輸入前添加目標語言標籤（如>>por<<）實現。

高性能翻譯

在flores101-devtest和tatoeba-test-v2021-08-07測試集上分別達到50.4和49.6的BLEU分數。

開源許可

採用cc-by-4.0許可，允許商業和研究用途。

模型能力

英語到葡萄牙語的文本翻譯

支持巴西和葡萄牙葡萄牙語變體

使用案例

內容本地化

網站內容翻譯

將英語網站內容翻譯成葡萄牙語，適用於巴西或葡萄牙市場。

高質量翻譯，BLEU分數達50.4

文檔翻譯

商業文件翻譯

將英語商業合同或報告翻譯成葡萄牙語。

保持專業術語準確性

🚀 opus-mt-tc-big-en-pt

這是一個用於將英語（en）翻譯成葡萄牙語（pt）的神經機器翻譯模型。該模型能夠為用戶提供高效、準確的英葡翻譯服務，在相關領域具有重要的應用價值。

🚀 快速開始

本模型是 OPUS - MT 項目的一部分，該項目致力於讓神經機器翻譯模型在世界上多種語言中廣泛可用且易於獲取。所有模型最初使用 Marian NMT 這一出色的框架進行訓練，它是一個用純 C++ 編寫的高效 NMT 實現。這些模型已通過 huggingface 的 transformers 庫轉換為 pyTorch 格式。訓練數據來自 OPUS，訓練流程採用 OPUS - MT - train 的程序。

相關出版物：OPUS - MT – Building open translation services for the World 和 The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT（如果您使用此模型，請引用）

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

✨ 主要特性

這是一個具有多種目標語言的多語言翻譯模型。
翻譯效果良好，在不同測試集上有一定的 BLEU 得分。

📦 安裝指南

文檔未提及具體安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>por<< Tom tried to stab me.",
    ">>por<< He has been to Hawaii several times."
]

model_name = "pytorch-models/opus-mt-tc-big-en-pt"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# 預期輸出:
#     O Tom tentou esfaquear-me.
#     Ele já esteve no Havaí várias vezes.

高級用法

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-pt")
print(pipe(">>por<< Tom tried to stab me."))

# 預期輸出: O Tom tentou esfaquear-me.

📚 詳細文檔

模型信息

屬性	詳情
發佈時間	2022 - 03 - 13
源語言	eng
目標語言	pob por
有效目標語言標籤	>>pob<< >>por<<
模型類型	transformer - big
訓練數據	opusTCv20210807+bt (源地址)
分詞方式	SentencePiece (spm32k,spm32k)
原始模型	opusTCv20210807+bt_transformer - big_2022 - 03 - 13.zip
已發佈模型更多信息	OPUS - MT eng - por README
模型更多信息	MarianMT

這是一個具有多種目標語言的多語言翻譯模型，需要以 >>id<<（id = 有效目標語言 ID）的形式使用句子初始語言標記，例如 >>pob<<。

基準測試

語言對	測試集	chr - F	BLEU	句子數量	單詞數量
eng - por	tatoeba - test - v2021 - 08 - 07	0.69320	49.6	13222	105265
eng - por	flores101 - devtest	0.71673	50.4	1012	26519

測試集翻譯：opusTCv20210807+bt_transformer - big_2022 - 03 - 13.test.txt
測試集得分：opusTCv20210807+bt_transformer - big_2022 - 03 - 13.eval.txt
基準測試結果：benchmark_results.txt
基準測試輸出：benchmark_translations.zip

致謝

這項工作得到了以下項目的支持：

[歐洲語言網格](https://www.european - language - grid.eu/) 的 [試點項目 2866](https://live.european - language - grid.eu/catalogue/#/resource/projects/2866)。
[FoTran 項目](https://www.helsinki.fi/en/researchgroups/natural - language - understanding - with - cross - lingual - grounding)，由歐洲研究理事會（ERC）在歐盟的“地平線 2020”研究和創新計劃（贈款協議編號 771113）資助。
MeMAD 項目，由歐盟的“地平線 2020”研究和創新計劃資助，贈款協議編號 780069。

我們也感謝 CSC -- 芬蘭科學信息技術中心提供的慷慨計算資源和 IT 基礎設施。