opus - mt - tc - bible - big - roa - deu_eng_fra_por_spa開源翻譯模型，多羅曼語到多語言翻譯必備

首頁

Opus Mt Tc Bible Big Roa Deu Eng Fra Por Spa

由Helsinki-NLP開發

這是一個多目標語言的神經機器翻譯模型，專門用於從多種羅曼語族語言翻譯至德語、英語、法語、葡萄牙語和西班牙語。

機器翻譯

Transformers

支持多種語言開源協議:Apache-2.0 #多語言聖經翻譯 #羅曼語族支持 #高精度BLEU55.6

下載量 25

發布時間 : 10/8/2024

模型概述

該模型是OPUS-MT項目的一部分，旨在為全球多種語言提供廣泛可用的神經機器翻譯模型。支持從安的列斯克里奧爾語、阿拉貢語等多種羅曼語族語言翻譯至德語、英語、法語、葡萄牙語和西班牙語。

模型特點

多目標語言支持

支持從多種羅曼語族語言翻譯至德語、英語、法語、葡萄牙語和西班牙語。

高性能翻譯

在tatoeba-test-v2020-07-28-v2023-09-26數據集上達到BLEU 55.6和chr-F 0.73367的高分。

廣泛的語言覆蓋

支持40多種源語言和5種目標語言，涵蓋多種羅曼語族語言和克里奧爾語。

模型能力

文本翻譯

多語言支持

神經機器翻譯

使用案例

語言翻譯

多語言文檔翻譯

將多種羅曼語族語言的文檔翻譯成德語、英語、法語、葡萄牙語或西班牙語。

高質量的翻譯結果，適用於商業、教育和研究用途。

跨語言交流

幫助用戶在不同語言之間進行即時交流。

快速準確的翻譯，提升溝通效率。

🚀 opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa

這是一款用於將羅曼語族語言翻譯成德語、英語、法語、葡萄牙語和西班牙語的神經機器翻譯模型。它屬於 OPUS - MT 項目，藉助 Marian NMT 框架訓練，並通過 Hugging Face 的 transformers 庫轉換為 PyTorch 模型。

🚀 快速開始

代碼示例

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

你也可以使用 transformers 管道來使用 OPUS - MT 模型，例如：

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

✨ 主要特性

支持多種羅曼語族語言到德語、英語、法語、葡萄牙語和西班牙語的翻譯。
屬於 OPUS - MT 項目，該項目致力於讓神經機器翻譯模型廣泛可用。
採用 Marian NMT 框架訓練，這是一個高效的純 C++ 實現的 NMT 框架。
通過 transformers 庫轉換為 PyTorch 模型，方便使用。

📦 安裝指南

文檔未提供具體安裝步驟，可參考 transformers 庫官方文檔進行安裝。

💻 使用示例

基礎用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

高級用法

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

📚 詳細文檔

模型詳情

該模型是用於將羅曼語族語言（roa）翻譯成德語、英語、法語、葡萄牙語和西班牙語的神經機器翻譯模型。

此模型是 OPUS - MT 項目的一部分，該項目致力於讓神經機器翻譯模型在世界多種語言中廣泛可用。所有模型最初使用 Marian NMT 這一出色的框架進行訓練，這是一個用純 C++ 編寫的高效 NMT 實現。這些模型通過 Hugging Face 的 transformers 庫轉換為 PyTorch 模型。訓練數據來自 OPUS，訓練管道採用 OPUS - MT - train 的流程。

模型描述：

屬性	詳情
開發者	赫爾辛基大學語言技術研究組
模型類型	翻譯（transformer - big）
發佈時間	2024 - 05 - 30
許可證	Apache - 2.0
源語言	acf、arg、ast、cat、cbk、cos、crs、egl、ext、fra、frm、fro、frp、fur、gcf、glg、hat、ita、kea、lad、lij、lld、lmo、lou、mfe、mol、mwl、nap、oci、osp、pap、pcd、pms、por、roh、ron、rup、scn、spa、srd、vec、wln
目標語言	deu、eng、fra、por、spa
有效目標語言標籤	>>deu<<、>>eng<<、>>fra<<、>>por<<、>>spa<<、>>xxx<<
原始模型	[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 30.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - deu+eng+fra+por+spa/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 30.zip)
更多信息資源	[OPUS - MT 儀表盤](https://opus.nlpl.eu/dashboard/index.php?pkg = opusmt&test = all&scoreslang = all&chart = standard&model = Tatoeba - MT - models/roa - deu%2Beng%2Bfra%2Bpor%2Bspa/opusTCv20230926max50%2Bbt%2Bjhubc_transformer - big_2024 - 05 - 30)、[OPUS - MT - train GitHub 倉庫](https://github.com/Helsinki - NLP/OPUS - MT - train)、transformers 庫中 MarianNMT 模型的更多信息、[Tatoeba 翻譯挑戰](https://github.com/Helsinki - NLP/Tatoeba - Challenge/)、[HPLT 雙語數據 v1（作為 Tatoeba 翻譯挑戰數據集的一部分）](https://hplt - project.org/datasets/v1)、[大規模並行聖經語料庫](https://aclanthology.org/L14 - 1215/)

這是一個具有多個目標語言的多語言翻譯模型。句子開頭需要使用 >>id<< 形式的語言標記（id 為有效的目標語言 ID），例如 >>deu<<。

用途

該模型可用於翻譯和文本到文本的生成。

風險、限制和偏差

⚠️ 重要提示

讀者應注意，該模型是在各種公共數據集上訓練的，這些數據集可能包含令人不安、冒犯性的內容，並可能傳播歷史和當前的刻板印象。

已有大量研究探討了語言模型的偏差和公平性問題（例如，參見 [Sheng 等人 (2021)](https://aclanthology.org/2021.acl - long.330.pdf) 和 Bender 等人 (2021)）。

訓練

數據：opusTCv20230926max50 + bt + jhubc ([來源](https://github.com/Helsinki - NLP/Tatoeba - Challenge))
預處理：SentencePiece（spm32k, spm32k）
模型類型：transformer - big
原始 MarianNMT 模型：[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 30.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - deu+eng+fra+por+spa/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 30.zip)
訓練腳本：[GitHub 倉庫](https://github.com/Helsinki - NLP/OPUS - MT - train)

評估

[OPUS - MT 儀表盤上的模型得分](https://opus.nlpl.eu/dashboard/index.php?pkg = opusmt&test = all&scoreslang = all&chart = standard&model = Tatoeba - MT - models/roa - deu%2Beng%2Bfra%2Bpor%2Bspa/opusTCv20230926max50%2Bbt%2Bjhubc_transformer - big_2024 - 05 - 30)
測試集翻譯：[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 29.test.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - deu+eng+fra+por+spa/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 29.test.txt)
測試集得分：[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 29.eval.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - deu+eng+fra+por+spa/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 29.eval.txt)
基準測試結果：benchmark_results.txt
基準測試輸出：benchmark_translations.zip

語言對	測試集	chr - F	BLEU	句子數量	單詞數量
multi - multi	tatoeba - test - v2020 - 07 - 28 - v2023 - 09 - 26	0.73367	55.6	10000	83852

引用信息

出版物：[Democratizing neural machine translation with OPUS - MT](https://doi.org/10.1007/s10579 - 023 - 09704 - w)、[OPUS - MT – Building open translation services for the World](https://aclanthology.org/2020.eamt - 1.61/)、[The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt - 1.139/)（如果使用此模型，請引用）

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

致謝

這項工作得到了 [HPLT 項目](https://hplt - project.org/) 的支持，該項目由歐盟的 Horizon Europe 研究與創新計劃資助，資助協議編號為 101070350。我們也感謝 CSC -- 芬蘭科學信息技術中心和 [EuroHPC 超級計算機 LUMI](https://www.lumi - supercomputer.eu/) 提供的慷慨計算資源和 IT 基礎設施。