opus-mt-tc-big-tr-en開源翻譯模型 - 免費實現土耳其語到英語高效翻譯

首頁

Opus Mt Tc Big Tr En

由Helsinki-NLP開發

這是一個基於Transformer架構的大型神經機器翻譯模型，專門用於從土耳其語翻譯到英語。

機器翻譯

Transformers

支持多種語言#土耳其語-英語翻譯 #高精度機器翻譯 #多領域適用

下載量 98.62k

發布時間 : 4/13/2022

模型概述

該模型是OPUS-MT項目的一部分，旨在為土耳其語到英語的翻譯任務提供高質量的機器翻譯服務。

模型特點

高質量翻譯

在多個基準測試中表現出色，特別是在Tatoeba測試集上BLEU得分達到57.6。

多領域支持

能夠處理新聞、日常對話等多種領域的文本翻譯。

開源許可

採用cc-by-4.0許可證，允許商業和研究用途。

模型能力

土耳其語到英語的文本翻譯

處理多種文本類型（新聞、對話等）

使用案例

內容本地化

新聞翻譯

將土耳其語新聞翻譯成英語

在newstest2018測試集上BLEU得分為30.7

教育

語言學習輔助

幫助學習者理解土耳其語內容

🚀 opus-mt-tc-big-tr-en

這是一個用於從土耳其語（tr）翻譯成英語（en）的神經機器翻譯模型。該模型能夠高效準確地完成土耳其語到英語的翻譯任務，為語言交流和信息傳播提供了有力支持。

🚀 快速開始

模型簡介

此模型是 [OPUS - MT 項目](https://github.com/Helsinki - NLP/Opus - MT) 的一部分，該項目致力於讓神經機器翻譯模型廣泛適用於世界上多種語言。所有模型最初使用 [Marian NMT](https://marian - nmt.github.io/) 這一出色的框架進行訓練，它是一個用純 C++ 編寫的高效神經機器翻譯實現。這些模型通過 huggingface 的 transformers 庫轉換為 pyTorch 格式。訓練數據來自 OPUS，訓練流程遵循 [OPUS - MT - train](https://github.com/Helsinki - NLP/Opus - MT - train) 的步驟。

相關出版物：[OPUS - MT – Building open translation services for the World](https://aclanthology.org/2020.eamt - 1.61/) 和 [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt - 1.139/)（如果使用此模型，請引用這些文獻）

@inproceedings{tiedemann - thottingal - 2020 - opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt - 1.61",
    pages = "479--480",
}

@inproceedings{tiedemann - 2020 - tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt - 1.139",
    pages = "1174--1182",
}

✨ 主要特性

多語言支持：支持土耳其語到英語的翻譯。
高效訓練：基於強大的 Marian NMT 框架訓練，保證了模型的性能。
廣泛應用：可用於多種場景下的語言翻譯。

📦 安裝指南

文檔未提及具體安裝步驟，暫不提供。

💻 使用示例

基礎用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "Allahsızlığı Yayma Kürsüsü başkanıydı.",
    "Tom'a ne olduğunu öğrenin."
]

model_name = "pytorch - models/opus - mt - tc - big - tr - en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     He was the president of the Curse of Spreading Godlessness.
#     Find out what happened to Tom.

高級用法

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki - NLP/opus - mt - tc - big - tr - en")
print(pipe("Allahsızlığı Yayma Kürsüsü başkanıydı."))

# expected output: He was the president of the Curse of Spreading Godlessness.

📚 詳細文檔

模型信息

屬性	詳情
發佈日期	2022 - 03 - 17
源語言	土耳其語（tur）
目標語言	英語（eng）
模型類型	大型變壓器（transformer - big）
訓練數據	opusTCv20210807 + bt ([源數據](https://github.com/Helsinki - NLP/Tatoeba - Challenge))
分詞方式	SentencePiece（spm32k,spm32k）
原始模型	[opusTCv20210807 + bt_transformer - big_2022 - 03 - 17.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/tur - eng/opusTCv20210807 + bt_transformer - big_2022 - 03 - 17.zip)
更多信息	[OPUS - MT tur - eng README](https://github.com/Helsinki - NLP/Tatoeba - Challenge/tree/master/models/tur - eng/README.md)

基準測試

語言對	測試集	chr - F	BLEU	句子數量	單詞數量
tur - eng	tatoeba - test - v2021 - 08 - 07	0.71895	57.6	13907	109231
tur - eng	flores101 - devtest	0.64152	37.6	1012	24721
tur - eng	newsdev2016	0.58658	32.1	1001	21988
tur - eng	newstest2016	0.56960	29.3	3000	66175
tur - eng	newstest2017	0.57455	29.7	3007	67703
tur - eng	newstest2018	0.58488	30.7	3000	68725

測試集翻譯：[opusTCv20210807 + bt_transformer - big_2022 - 03 - 17.test.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/tur - eng/opusTCv20210807 + bt_transformer - big_2022 - 03 - 17.test.txt)
測試集分數：[opusTCv20210807 + bt_transformer - big_2022 - 03 - 17.eval.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/tur - eng/opusTCv20210807 + bt_transformer - big_2022 - 03 - 17.eval.txt)
基準測試結果：benchmark_results.txt
基準測試輸出：benchmark_translations.zip

🔧 技術細節

模型轉換信息

transformers 版本：4.16.2
OPUS - MT git 哈希值：3405783
轉換時間：2022 年 4 月 13 日星期三 20:02:48 EEST
轉換機器：LM0 - 400 - 22516.local

📄 許可證

本模型採用 CC - BY - 4.0 許可證。

致謝

這項工作得到了以下項目的支持：

[歐洲語言網格](https://www.european - language - grid.eu/) 的 [試點項目 2866](https://live.european - language - grid.eu/catalogue/#/resource/projects/2866)。
[FoTran 項目](https://www.helsinki.fi/en/researchgroups/natural - language - understanding - with - cross - lingual - grounding)，由歐盟的“地平線 2020”研究和創新計劃下的歐洲研究理事會（ERC）資助（資助協議編號 771113）。
MeMAD 項目，由歐盟的“地平線 2020”研究和創新計劃資助（資助協議編號 780069）。

同時，我們感謝 CSC -- 芬蘭科學信息技術中心提供的慷慨計算資源和 IT 基礎設施。