opus-mt-tc-bible-big-roa-en開源翻譯模型 - 免費實現羅曼語系到英語翻譯

首頁

Opus Mt Tc Bible Big Roa En

由Helsinki-NLP開發

這是一個用於將羅曼語系（roa）語言翻譯成英語（en）的神經機器翻譯模型，屬於OPUS-MT項目的一部分。

機器翻譯

Transformers

支持多種語言開源協議:Apache-2.0 #羅曼語系翻譯 #聖經語料訓練 #多語言支持

下載量 2,985

發布時間 : 10/8/2024

模型概述

該模型專門用於將多種羅曼語系語言翻譯成英語，基於Transformer架構訓練，適用於文本翻譯任務。

模型特點

多語言支持

支持多種羅曼語系語言到英語的翻譯

高質量翻譯

基於OPUS數據集訓練，提供高質量的翻譯結果

易於集成

可通過Hugging Face Transformers庫輕鬆集成到應用中

模型能力

文本翻譯

多語言處理

使用案例

語言翻譯

文檔翻譯

將羅曼語系語言的文檔翻譯成英語

高質量的英語翻譯結果

即時翻譯

用於即時聊天或會議的翻譯服務

快速準確的翻譯響應

🚀 opus-mt-tc-bible-big-roa-en

這是一個用於將羅曼語系（roa）語言翻譯成英語（en）的神經機器翻譯模型。它屬於OPUS - MT項目的一部分，旨在讓神經機器翻譯模型廣泛可用，為世界上多種語言提供服務。

🚀 快速開始

簡單示例代碼

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "É caro demais.",
    "Estamos muertos."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-roa-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     It's too expensive.
#     We're dead.

你也可以使用transformers的pipeline來使用OPUS - MT模型，例如：

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-en")
print(pipe("É caro demais."))

# expected output: It's too expensive.

✨ 主要特性

支持多種羅曼語系語言到英語的翻譯。
屬於OPUS - MT項目，藉助Marian NMT框架訓練，後轉換為pyTorch模型。
訓練數據來源於OPUS，訓練流程遵循OPUS - MT - train的程序。

📦 安裝指南

文檔未提及安裝步驟，暫不提供。

💻 使用示例

基礎用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "É caro demais.",
    "Estamos muertos."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-roa-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     It's too expensive.
#     We're dead.

高級用法

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-en")
print(pipe("É caro demais."))

# expected output: It's too expensive.

📚 詳細文檔

模型詳情

這是一個用於將羅曼語系（roa）語言翻譯成英語（en）的神經機器翻譯模型。

該模型是[OPUS - MT項目](https://github.com/Helsinki - NLP/Opus - MT)的一部分，該項目致力於讓神經機器翻譯模型廣泛可用，為世界上多種語言提供服務。所有模型最初使用[Marian NMT](https://marian - nmt.github.io/)框架進行訓練，這是一個用純C++編寫的高效NMT實現。這些模型通過huggingface的transformers庫轉換為pyTorch模型。訓練數據來自OPUS，訓練流程採用[OPUS - MT - train](https://github.com/Helsinki - NLP/Opus - MT - train)的程序。

屬性	詳情
開發團隊	赫爾辛基大學語言技術研究小組
模型類型	翻譯（transformer - big）
發佈時間	2024 - 08 - 17
許可證	Apache - 2.0
源語言	acf arg ast cat cbk cos egl ext fra frm frp fur gcf glg hat ita kea lad lij lld lmo lou mfe mol mwl nap oci osp pap pms por roh ron rup scn spa srd vec wln
目標語言	eng
原始模型	[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - eng/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.zip)
更多信息資源	[OPUS - MT儀表盤](https://opus.nlpl.eu/dashboard/index.php?pkg=opusmt&test=all&scoreslang=all&chart=standard&model=Tatoeba - MT - models/roa - eng/opusTCv20230926max50%2Bbt%2Bjhubc_transformer - big_2024 - 08 - 17) [OPUS - MT - train GitHub倉庫](https://github.com/Helsinki - NLP/OPUS - MT - train) transformers庫中關於MarianNMT模型的更多信息 [Tatoeba翻譯挑戰](https://github.com/Helsinki - NLP/Tatoeba - Challenge/) [HPLT雙語數據v1（作為Tatoeba翻譯挑戰數據集的一部分）](https://hplt - project.org/datasets/v1) [大規模並行聖經語料庫](https://aclanthology.org/L14 - 1215/)

用途

此模型可用於翻譯和文本到文本的生成。

風險、限制和偏差

⚠️ 重要提示

讀者應注意，該模型是在各種公共數據集上訓練的，這些數據集可能包含令人不安、冒犯性的內容，並可能傳播歷史和當前的刻板印象。

已有大量研究探討了語言模型的偏差和公平性問題（例如，參見[Sheng等人（2021）](https://aclanthology.org/2021.acl - long.330.pdf)和Bender等人（2021））。

訓練

數據：opusTCv20230926max50+bt+jhubc（[來源](https://github.com/Helsinki - NLP/Tatoeba - Challenge)）
預處理：SentencePiece（spm32k,spm32k）
模型類型：transformer - big
原始MarianNMT模型：[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - eng/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.zip)
訓練腳本：[GitHub倉庫](https://github.com/Helsinki - NLP/OPUS - MT - train)

評估

[OPUS - MT儀表盤上的模型得分](https://opus.nlpl.eu/dashboard/index.php?pkg=opusmt&test=all&scoreslang=all&chart=standard&model=Tatoeba - MT - models/roa - eng/opusTCv20230926max50%2Bbt%2Bjhubc_transformer - big_2024 - 08 - 17)
測試集翻譯：[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.test.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - eng/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.test.txt)
測試集得分：[opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.eval.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/roa - eng/opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 08 - 17.eval.txt)
基準測試結果：benchmark_results.txt
基準測試輸出：benchmark_translations.zip

語言對	測試集	chr - F	BLEU	句子數量	單詞數量
multi - eng	tatoeba - test - v2020 - 07 - 28 - v2023 - 09 - 26	0.76737	62.8	10000	87576

引用信息

出版物：[通過OPUS - MT實現神經機器翻譯的民主化](https://doi.org/10.1007/s10579 - 023 - 09704 - w)、[OPUS - MT – 為世界構建開放翻譯服務](https://aclanthology.org/2020.eamt - 1.61/)和[塔託埃巴翻譯挑戰 – 低資源和多語言機器翻譯的現實數據集](https://aclanthology.org/2020.wmt - 1.139/)（如果使用此模型，請引用）。

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

致謝

這項工作得到了[HPLT項目](https://hplt - project.org/)的支持，該項目由歐盟的“地平線歐洲”研究與創新計劃資助，資助協議編號為101070350。我們也感謝芬蘭CSC - 科學信息技術中心和[歐洲高性能計算機LUMI](https://www.lumi - supercomputer.eu/)提供的慷慨計算資源和IT基礎設施。