opus-mt-tc-big-zh-ja開源翻譯模型 - 免費實現精準中文到日文翻譯

首頁

Opus Mt Tc Big Zh Ja

由Helsinki-NLP開發

這是一個由赫爾辛基大學開發的神經機器翻譯模型，專門用於中文到日文的翻譯任務。

機器翻譯

Transformers

支持多種語言#中文-日文翻譯 #高精度神經機器翻譯 #多語言支持

下載量 190

發布時間 : 8/12/2022

模型概述

該模型是OPUS-MT項目的一部分，基於transformer-big架構訓練，能夠實現中文到日文的文本翻譯。

模型特點

高質量翻譯

在tatoeba-test-v2021-08-07測試集上達到24.6 BLEU分數，表現優秀。

多語言支持

專門針對中文到日文的翻譯任務優化，支持兩種語言的互譯。

開源許可

採用cc-by-4.0許可證，允許商業和研究用途的自由使用。

模型能力

文本翻譯

跨語言文本生成

使用案例

內容翻譯

社交媒體內容翻譯

將中文社交媒體內容翻譯為日文，便於跨語言交流。

高質量保留原文語義

商務文檔翻譯

將中文商務文件翻譯為日文版本。

專業術語準確翻譯

教育

語言學習輔助

幫助學習中文或日文的學生理解另一種語言的文本。

提供準確的翻譯參考

🚀 opus-mt-tc-big-zh-ja

這是一個用於中文（zh）到日文（ja）翻譯的神經機器翻譯模型。它屬於OPUS - MT項目，旨在讓神經機器翻譯模型廣泛適用於全球多種語言。

🚀 快速開始

示例代碼

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "生日快樂，Muiriel！",
    "好凍。"
]

model_name = "pytorch-models/opus-mt-tc-big-zh-ja"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     誕生日おめでとう、Muiriel!
#     寒い。

你也可以使用transformers管道來使用OPUS - MT模型，例如：

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-zh-ja")
print(pipe("生日快樂，Muiriel！"))

# expected output: 誕生日おめでとう、Muiriel!

✨ 主要特性

該模型可用於翻譯和文本到文本的生成。

📚 詳細文檔

模型詳情

這是一個用於從中文（zh）翻譯成日文（ja）的神經機器翻譯模型。

此模型是OPUS - MT項目的一部分，該項目致力於讓神經機器翻譯模型在世界上多種語言中廣泛可用且易於獲取。所有模型最初都使用Marian NMT這一出色的框架進行訓練，它是一個用純C++編寫的高效NMT實現。這些模型已使用huggingface的transformers庫轉換為pyTorch。訓練數據來自OPUS，訓練管道採用OPUS - MT - train的流程。 模型描述：

屬性	詳情
開發者	赫爾辛基大學語言技術研究組
模型類型	翻譯（transformer - big）
發佈時間	2022 - 07 - 28
許可證	CC - BY - 4.0
源語言	中文（zho）
目標語言	日文（jpn）
語言對	中文 - 日文（zho - jpn）
原始模型	opusTCv20210807 - sepvoc_transformer - big_2022 - 07 - 28.zip
更多信息資源	OPUS - MT - train GitHub倉庫；此語言對已發佈模型的更多信息：OPUS - MT zho - jpn README；transformers庫中MarianNMT模型的更多信息；Tatoeba翻譯挑戰

風險、侷限性和偏差

⚠️ 重要提示

讀者應該注意，該模型是在各種公共數據集上訓練的，這些數據集可能包含令人不安、冒犯性的內容，並且可能傳播歷史和當前的刻板印象。

大量研究已經探討了語言模型的偏差和公平性問題（例如，參見Sheng等人（2021）和Bender等人（2021））。

訓練

數據：opusTCv20210807（來源）
預處理：SentencePiece（spm32k,spm32k）
模型類型：transformer - big
原始MarianNMT模型：opusTCv20210807 - sepvoc_transformer - big_2022 - 07 - 28.zip
訓練腳本：GitHub倉庫

評估

測試集翻譯：opusTCv20210807 - sepvoc_transformer - big_2022 - 07 - 28.test.txt
測試集得分：opusTCv20210807 - sepvoc_transformer - big_2022 - 07 - 28.eval.txt
基準測試結果：benchmark_results.txt
基準測試輸出：benchmark_translations.zip

語言對	測試集	chr - F	BLEU	句子數量	單詞數量
中文 - 日文	tatoeba - test - v2021 - 08 - 07	0.27790	24.6	2497	21956

引用信息

出版物：OPUS - MT – 為世界構建開放翻譯服務和The Tatoeba Translation Challenge – 低資源和多語言MT的現實數據集（如果使用此模型，請引用）

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

致謝

這項工作得到了歐洲語言網格作為試點項目2866的支持，以及由歐洲研究理事會（ERC）在歐盟的“地平線2020”研究和創新計劃（贈款協議編號771113）資助的FoTran項目和由歐盟的“地平線2020”研究和創新計劃（贈款協議編號780069）資助的MeMAD項目的支持。我們也感謝芬蘭CSC - 科學信息技術中心提供的慷慨計算資源和IT基礎設施。