opus-mt-tc-big-cat_oci_spa-en開源翻譯模型 - 免費實現多語言到英語的便捷翻譯

首頁

Opus Mt Tc Big Cat Oci Spa En

由Helsinki-NLP開發

這是一個用於從加泰羅尼亞語、奧克語和西班牙語翻譯到英語的神經機器翻譯模型，屬於OPUS-MT項目的一部分。

機器翻譯

Transformers

支持多種語言#多語言翻譯 #高BLEU得分 #羅曼語系轉英語

下載量 24

發布時間 : 4/13/2022

模型概述

該模型專注於將加泰羅尼亞語、奧克語和西班牙語文本翻譯成英語，採用transformer-big架構，訓練數據來自OPUS語料庫。

模型特點

多語言支持

支持加泰羅尼亞語、奧克語和西班牙語到英語的翻譯

高性能翻譯

在多個測試集上表現出色，如Tatoeba測試集的BLEU得分高達62.3

開源許可證

採用cc-by-4.0許可證，允許商業和研究使用

模型能力

加泰羅尼亞語到英語翻譯

奧克語到英語翻譯

西班牙語到英語翻譯

多語言機器翻譯

使用案例

文本翻譯

文檔翻譯

將加泰羅尼亞語、奧克語或西班牙語文檔翻譯成英語

高質量翻譯結果，BLEU得分達62.3（西班牙語-英語）

內容本地化

幫助將網站或應用程序內容本地化為英語

支持多種羅曼語系語言到英語的轉換

學術研究

機器翻譯研究

用於比較不同語言對翻譯性能的研究

提供多個測試集的基準性能數據

🚀 opus-mt-tc-big-cat_oci_spa-en

這是一個用於從加泰羅尼亞語、奧克語和西班牙語（cat+oci+spa）翻譯成英語（en）的神經機器翻譯模型。本模型是[OPUS - MT項目](https://github.com/Helsinki - NLP/Opus - MT)的一部分，該項目致力於讓神經機器翻譯模型在全球多種語言中廣泛可用且易於獲取。所有模型最初都使用[Marian NMT](https://marian - nmt.github.io/)這一出色的框架進行訓練，它是一個用純C++編寫的高效NMT實現。這些模型通過huggingface的transformers庫轉換為pyTorch格式。訓練數據來自OPUS，訓練流程採用了[OPUS - MT - train](https://github.com/Helsinki - NLP/Opus - MT - train)的方法。

相關出版物：[OPUS - MT – Building open translation services for the World](https://aclanthology.org/2020.eamt - 1.61/) 和 [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt - 1.139/)（如果使用此模型，請引用這些文獻）

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

🚀 快速開始

本模型可用於將加泰羅尼亞語、奧克語和西班牙語翻譯成英語。以下是使用示例：

基礎用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "¿Puedo hacerte una pregunta?",
    "Toca algo de música."
]

model_name = "pytorch-models/opus-mt-tc-big-cat_oci_spa-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# 預期輸出:
#     Can I ask you a question?
#     He plays some music.

高級用法

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-cat_oci_spa-en")
print(pipe("¿Puedo hacerte una pregunta?"))

# 預期輸出: Can I ask you a question?

✨ 主要特性

支持加泰羅尼亞語、奧克語和西班牙語到英語的翻譯。
基於OPUS - MT項目，利用了Marian NMT框架進行訓練，並轉換為pyTorch格式。
訓練數據來自OPUS，採用了OPUS - MT - train的訓練流程。

📦 安裝指南

文檔未提供安裝步驟，可參考OPUS - MT項目相關文檔進行安裝。

📚 詳細文檔

模型信息

屬性	詳情
發佈日期	2022 - 03 - 13
源語言	加泰羅尼亞語（cat）、西班牙語（spa）
目標語言	英語（eng）
模型類型	大型Transformer（transformer - big）
訓練數據	opusTCv20210807 + bt ([來源](https://github.com/Helsinki - NLP/Tatoeba - Challenge))
分詞方式	SentencePiece（spm32k,spm32k）
原始模型	[opusTCv20210807 + bt_transformer - big_2022 - 03 - 13.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/cat + oci + spa - eng/opusTCv20210807 + bt_transformer - big_2022 - 03 - 13.zip)
更多模型信息	[OPUS - MT cat + oci + spa - eng README](https://github.com/Helsinki - NLP/Tatoeba - Challenge/tree/master/models/cat + oci + spa - eng/README.md)

基準測試

測試集翻譯結果：[opusTCv20210807 + bt_transformer - big_2022 - 03 - 13.test.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/cat + oci + spa - eng/opusTCv20210807 + bt_transformer - big_2022 - 03 - 13.test.txt)
測試集得分：[opusTCv20210807 + bt_transformer - big_2022 - 03 - 13.eval.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/cat + oci + spa - eng/opusTCv20210807 + bt_transformer - big_2022 - 03 - 13.eval.txt)
基準測試結果：benchmark_results.txt
基準測試輸出：benchmark_translations.zip

語言對	測試集	chr - F	BLEU	句子數量	單詞數量
cat - eng	tatoeba - test - v2021 - 08 - 07	0.72019	57.3	1631	12627
spa - eng	tatoeba - test - v2021 - 08 - 07	0.76017	62.3	16583	138123
cat - eng	flores101 - devtest	0.69572	45.4	1012	24721
oci - eng	flores101 - devtest	0.63347	37.5	1012	24721
spa - eng	flores101 - devtest	0.59696	29.9	1012	24721
spa - eng	newssyscomb2009	0.57104	30.8	502	11818
spa - eng	news - test2008	0.55440	27.9	2051	49380
spa - eng	newstest2009	0.57153	30.2	2525	65399
spa - eng	newstest2010	0.61890	36.8	2489	61711
spa - eng	newstest2011	0.60278	34.7	3003	74681
spa - eng	newstest2012	0.62760	38.6	3003	72812
spa - eng	newstest2013	0.60994	35.3	3000	64505
spa - eng	tico19 - test	0.74033	51.8	2100	56315

🔧 技術細節

文檔未提供具體技術實現細節。

📄 許可證

本模型使用CC - BY - 4.0許可證。

致謝

本工作得到了以下項目的支持：

[歐洲語言網格](https://www.european - language - grid.eu/)的[試點項目2866](https://live.european - language - grid.eu/catalogue/#/resource/projects/2866)。
[FoTran項目](https://www.helsinki.fi/en/researchgroups/natural - language - understanding - with - cross - lingual - grounding)，由歐盟“地平線2020”研究與創新計劃（資助協議編號771113）下的歐洲研究理事會（ERC）資助。
MeMAD項目，由歐盟“地平線2020”研究與創新計劃（資助協議編號780069）資助。

同時，感謝芬蘭科學信息技術中心（CSC）提供的慷慨計算資源和IT基礎設施。

模型轉換信息

屬性	詳情
transformers版本	4.16.2
OPUS - MT git哈希值	3405783
轉換時間	2022年4月13日星期三18:30:38（東歐夏令時）
轉換機器	LM0 - 400 - 22516.local