opus-mt-tc-big-de-es開源翻譯模型 - 免費實現德語到西班牙語的精準翻譯

首頁

Opus Mt Tc Big De Es

由Helsinki-NLP開發

這是一個由赫爾辛基大學語言技術研究小組開發的德語到西班牙語的神經機器翻譯模型，屬於OPUS-MT項目的一部分。

機器翻譯

Transformers

支持多種語言#德語-西班牙語翻譯 #高精度機器翻譯 #多領域適用

下載量 33

發布時間 : 8/12/2022

模型概述

該模型專門用於德語到西班牙語的翻譯任務，基於transformer-big架構訓練，支持高質量的文本翻譯。

模型特點

高質量翻譯

在多個測試集上表現出色，BLEU分數最高達50.8。

多數據集訓練

使用OPUS等多個公開數據集訓練，涵蓋廣泛領域。

開源許可

採用CC-BY-4.0許可證，允許商業和研究使用。

模型能力

德語到西班牙語文本翻譯

批量文本處理

支持多種文本領域翻譯

使用案例

內容翻譯

新聞翻譯

將德語新聞文章翻譯成西班牙語

在newstest2010測試集上達到33.8 BLEU分數

社交媒體內容翻譯

翻譯社交媒體帖子和評論

在tatoeba-test-v2021-08-07測試集上達到50.8 BLEU分數

教育

學習輔助

幫助語言學習者理解德語內容

🚀 opus-mt-tc-big-de-es

該模型是一個用於將德語（de）翻譯成西班牙語（es）的神經機器翻譯模型。它屬於OPUS - MT項目的一部分，旨在讓神經機器翻譯模型在全球多種語言中廣泛可用。

🚀 快速開始

代碼示例

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "Ich verstehe nicht, worüber ihr redet.",
    "Die Vögel singen in den Bäumen."
]

model_name = "pytorch-models/opus-mt-tc-big-de-es"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     No entiendo de qué están hablando.
#     Los pájaros cantan en los árboles.

你也可以使用transformers管道來使用OPUS - MT模型，例如：

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-de-es")
print(pipe("Ich verstehe nicht, worüber ihr redet."))

# expected output: No entiendo de qué están hablando.

✨ 主要特性

該模型是用於從德語（de）到西班牙語（es）的神經機器翻譯模型。
屬於[OPUS - MT項目](https://github.com/Helsinki - NLP/Opus - MT)，旨在讓神經機器翻譯模型廣泛可用。
最初使用[Marian NMT](https://marian - nmt.github.io/)框架進行訓練，後使用huggingface的transformers庫轉換為pyTorch。
訓練數據來自OPUS，訓練管道採用[OPUS - MT - train](https://github.com/Helsinki - NLP/Opus - MT - train)的流程。

📦 安裝指南

文檔未提供具體安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "Ich verstehe nicht, worüber ihr redet.",
    "Die Vögel singen in den Bäumen."
]

model_name = "pytorch-models/opus-mt-tc-big-de-es"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     No entiendo de qué están hablando.
#     Los pájaros cantan en los árboles.

高級用法

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-de-es")
print(pipe("Ich verstehe nicht, worüber ihr redet."))

# expected output: No entiendo de qué están hablando.

📚 詳細文檔

模型詳情

開發者：赫爾辛基大學語言技術研究小組
模型類型：翻譯（transformer - big）
發佈時間：2022 - 07 - 26
許可證：CC - BY - 4.0
語言：
- 源語言：deu
- 目標語言：spa
- 語言對：deu - spa
- 有效目標語言標籤：無
原始模型：[opusTCv20210807_transformer - big_2022 - 07 - 26.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/deu - spa/opusTCv20210807_transformer - big_2022 - 07 - 26.zip)
更多信息資源：
- [OPUS - MT - train GitHub倉庫](https://github.com/Helsinki - NLP/OPUS - MT - train)
- 此語言對已發佈模型的更多信息：[OPUS - MT deu - spa README](https://github.com/Helsinki - NLP/Tatoeba - Challenge/tree/master/models/deu - spa/README.md)
- transformers庫中MarianNMT模型的更多信息
- [Tatoeba翻譯挑戰](https://github.com/Helsinki - NLP/Tatoeba - Challenge/)

用途

該模型可用於翻譯和文本到文本的生成。

風險、限制和偏差

⚠️ 重要提示

讀者應注意，該模型是在各種公共數據集上訓練的，這些數據集可能包含令人不安、冒犯性的內容，並可能傳播歷史和當前的刻板印象。

大量研究已經探討了語言模型的偏差和公平性問題（例如，參見[Sheng等人（2021）](https://aclanthology.org/2021.acl - long.330.pdf)和Bender等人（2021））。

訓練

數據：opusTCv20210807 ([來源](https://github.com/Helsinki - NLP/Tatoeba - Challenge))
預處理：SentencePiece (spm32k,spm32k)
模型類型：transformer - big
原始MarianNMT模型：[opusTCv20210807_transformer - big_2022 - 07 - 26.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/deu - spa/opusTCv20210807_transformer - big_2022 - 07 - 26.zip)
訓練腳本：[GitHub倉庫](https://github.com/Helsinki - NLP/OPUS - MT - train)

評估

測試集翻譯：[opusTCv20210807_transformer - big_2022 - 07 - 26.test.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/deu - spa/opusTCv20210807_transformer - big_2022 - 07 - 26.test.txt)
測試集得分：[opusTCv20210807_transformer - big_2022 - 07 - 26.eval.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/deu - spa/opusTCv20210807_transformer - big_2022 - 07 - 26.eval.txt)
基準測試結果：benchmark_results.txt
基準測試輸出：benchmark_translations.zip

屬性	詳情
模型類型	翻譯（transformer - big）
訓練數據	opusTCv20210807

語言對	測試集	chr - F	BLEU	句子數量	單詞數量
deu - spa	tatoeba - test - v2021 - 08 - 07	0.69105	50.8	10521	82570
deu - spa	flores101 - devtest	0.53208	24.9	1012	29199
deu - spa	newssyscomb2009	0.55547	28.3	502	12503
deu - spa	news - test2008	0.54400	26.6	2051	52586
deu - spa	newstest2009	0.53934	25.9	2525	68111
deu - spa	newstest2010	0.60102	33.8	2489	65480
deu - spa	newstest2011	0.57133	31.3	3003	79476
deu - spa	newstest2012	0.58119	32.6	3003	79006
deu - spa	newstest2013	0.57559	32.4	3000	70528

引用信息

如果你使用此模型，請引用以下出版物：[OPUS - MT – Building open translation services for the World](https://aclanthology.org/2020.eamt - 1.61/) 和 [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt - 1.139/)。

@inproceedings{tiedemann - thottingal - 2020 - opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt - 1.61",
    pages = "479--480",
}

@inproceedings{tiedemann - 2020 - tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt - 1.139",
    pages = "1174--1182",
}

致謝

這項工作得到了[歐洲語言網格](https://www.european - language - grid.eu/)的支持，作為[試點項目2866](https://live.european - language - grid.eu/catalogue/#/resource/projects/2866)；還得到了[FoTran項目](https://www.helsinki.fi/en/researchgroups/natural - language - understanding - with - cross - lingual - grounding)的資助，該項目由歐洲研究理事會（ERC）在歐盟的“地平線2020”研究和創新計劃（資助協議編號771113）下資助；以及MeMAD項目的資助，該項目由歐盟的“地平線2020”研究和創新計劃在資助協議編號780069下資助。我們也感謝CSC -- 芬蘭科學信息技術中心提供的慷慨計算資源和IT基礎設施。