opus-mt-tc-bible-big-deu_eng_fra_por_spa-mul開源模型 - 支持超100種語言的多任務自然語言處理

首頁

Opus Mt Tc Bible Big Deu Eng Fra Por Spa Mul

由Helsinki-NLP開發

支持超過100種語言的通用Transformer模型，適用於多種自然語言處理任務

大型語言模型

Transformers

支持多種語言開源協議:Apache-2.0 #多語言翻譯 #低資源語言處理 #跨語言理解

下載量 203

發布時間 : 10/9/2024

模型概述

該模型基於Transformer架構，專注於處理多種低資源語言，特別適用於非洲、亞洲和美洲的少數民族語言處理

模型特點

廣泛語言支持

支持100多種語言，特別關注低資源語言和少數民族語言

多任務處理

能夠同時處理多種自然語言處理任務

低資源優化

針對數據稀缺語言進行了特別優化

模型能力

文本分類

語言翻譯

文本生成

命名實體識別

情感分析

使用案例

語言保護

少數民族語言數字化

幫助數字化和保護瀕危語言

為語言學家提供研究工具

商業應用

多語言客服系統

支持小眾語言的自動客服

擴展服務覆蓋範圍

🚀 opus-mt-tc-bible-big-deu_eng_fra_por_spa-mul

這是一款用於從德語、英語、法語、葡萄牙語和西班牙語翻譯到多種語言的神經機器翻譯模型。它能助力用戶實現多語言間的文本翻譯和文本生成，但由於部分語言訓練數據有限，翻譯效果可能參差不齊。

🚀 快速開始

以下是使用該模型進行翻譯的簡短示例代碼：

基礎用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>aai<< Replace this with text in an accepted source language.",
    ">>zza<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-deu_eng_fra_por_spa-mul"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

你也可以使用transformers的管道（pipeline）來使用OPUS - MT模型：

高級用法

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-mul")
print(pipe(">>aai<< Replace this with text in an accepted source language."))

✨ 主要特性

多語言支持：支持從德語、英語、法語、葡萄牙語和西班牙語翻譯到多種目標語言。
開源框架：基於Marian NMT框架訓練，並使用transformers庫轉換為pyTorch模型，方便開發者使用。

📦 安裝指南

文檔未提及安裝步驟，可參考相關依賴庫（如transformers）的官方安裝說明。

📚 詳細文檔

模型詳情

這是一個用於從未知（德語 + 英語 + 法語 + 葡萄牙語 + 西班牙語）翻譯到多種語言（mul）的神經機器翻譯模型。需要注意的是，由於大多數語言的訓練數據非常有限，模型對許多列出的語言支持並不理想。翻譯性能差異很大，對於大量的語言對，模型可能根本無法工作。

該模型是OPUS - MT項目的一部分，該項目致力於讓神經機器翻譯模型在世界上的多種語言中廣泛可用和可訪問。所有模型最初都使用Marian NMT這個出色的框架進行訓練，它是一個用純C++編寫的高效NMT實現。這些模型已通過huggingface的transformers庫轉換為pyTorch。訓練數據來自OPUS，訓練管道使用OPUS - MT - train的程序。

屬性	詳情
開發者	赫爾辛基大學語言技術研究小組
模型類型	翻譯（transformer - big）
發佈時間	2024 - 05 - 30
許可證	Apache - 2.0
源語言	德語、英語、法語、葡萄牙語、西班牙語
目標語言	眾多語言（文檔中詳細列出）
原始模型	opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 30.zip
更多信息資源	OPUS - MT儀表盤；OPUS - MT - train GitHub倉庫；transformers庫中關於MarianNMT模型的更多信息；Tatoeba翻譯挑戰；HPLT雙語數據v1（作為Tatoeba翻譯挑戰數據集的一部分）；大規模並行聖經語料庫

這是一個具有多個目標語言的多語言翻譯模型。需要以>>id<<（id = 有效的目標語言ID）的形式提供句子初始語言標記，例如>>aai<<。

用途

該模型可用於翻譯和文本到文本的生成。

風險、限制和偏差

⚠️ 重要提示

讀者應該注意，該模型是在各種公共數據集上訓練的，這些數據集可能包含令人不安、冒犯性的內容，並且可能傳播歷史和當前的刻板印象。

大量研究已經探討了語言模型的偏差和公平性問題（例如，參見Sheng等人（2021）和Bender等人（2021））。

此外，由於大多數語言的訓練數據非常有限，模型對許多列出的語言支持並不理想。翻譯性能差異很大，對於大量的語言對，模型可能根本無法工作。

訓練

數據：opusTCv20230926max50+bt+jhubc（來源）
預處理：SentencePiece（spm32k,spm32k）
模型類型：transformer - big
原始MarianNMT模型：opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 30.zip
訓練腳本：GitHub倉庫

評估

OPUS - MT儀表盤上的模型得分
測試集翻譯：opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 29.test.txt
測試集得分：opusTCv20230926max50+bt+jhubc_transformer - big_2024 - 05 - 29.eval.txt
基準測試結果：benchmark_results.txt
基準測試輸出：benchmark_translations.zip

語言對	測試集	chr - F	BLEU	句子數量	單詞數量
multi - multi	tatoeba - test - v2020 - 07 - 28 - v2023 - 09 - 26	0.55024	29.2	10000	75838

引用信息

出版物：Democratizing neural machine translation with OPUS - MT、OPUS - MT – Building open translation services for the World和The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT（如果使用此模型，請引用）

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}