opus - mt - tc - big - en - fr開源翻譯模型，免費實現英語到法語高質量翻譯

首頁

Opus Mt Tc Big En Fr

由Helsinki-NLP開發

這是一個基於Transformer架構的神經機器翻譯模型，專門用於將英語翻譯成法語。它是OPUS-MT項目的一部分，旨在提供廣泛的語言覆蓋和易於訪問的翻譯服務。

機器翻譯

Transformers

支持多種語言#英法神經機器翻譯 #大模型架構 #多領域適配

下載量 27.11k

發布時間 : 4/13/2022

模型概述

該模型採用高效的Marian NMT框架訓練，支持從英語到法語的高質量翻譯，適用於多種文本類型和應用場景。

模型特點

高效翻譯

基於Transformer-big架構，提供高質量的英語到法語翻譯。

廣泛覆蓋

訓練數據來自OPUS項目，涵蓋多種文本類型和領域。

易於使用

支持通過Hugging Face的transformers庫輕鬆調用和集成。

模型能力

文本翻譯

多領域翻譯

高質量翻譯

使用案例

教育

語言學習

幫助學生或語言學習者快速翻譯英語文本到法語。

提高學習效率，增強語言理解能力。

商業

文檔翻譯

用於企業文檔、合同或報告的英語到法語翻譯。

節省人工翻譯成本，提高工作效率。

🚀 opus-mt-tc-big-en-fr

這是一個用於將英語（en）翻譯成法語（fr）的神經機器翻譯模型。它是OPUS - MT項目的一部分，該項目致力於讓神經機器翻譯模型在全球多種語言中廣泛可用。

🚀 快速開始

模型簡介

此模型是英語到法語的神經機器翻譯模型，屬於 [OPUS - MT項目](https://github.com/Helsinki - NLP/Opus - MT)。所有模型最初使用 [Marian NMT](https://marian - nmt.github.io/) 框架進行訓練，這是一個用純C++編寫的高效NMT實現。這些模型已通過huggingface的transformers庫轉換為pyTorch。訓練數據來自 OPUS，訓練流程採用 [OPUS - MT - train](https://github.com/Helsinki - NLP/Opus - MT - train) 的方法。

引用信息

出版物：[OPUS - MT – Building open translation services for the World](https://aclanthology.org/2020.eamt - 1.61/) 和 [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt - 1.139/)（如果使用此模型，請引用）

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

✨ 主要特性

廣泛可用：作為OPUS - MT項目的一部分，助力全球多種語言的神經機器翻譯。
高效訓練：基於Marian NMT框架訓練，並用transformers庫轉換為pyTorch。
豐富數據：訓練數據來自OPUS，訓練流程遵循OPUS - MT - train。

📦 安裝指南

文檔未提及安裝步驟，跳過此章節。

💻 使用示例

基礎用法

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "The Portuguese teacher is very demanding.",
    "When was your last hearing test?"
]

model_name = "pytorch-models/opus-mt-tc-big-en-fr"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Le professeur de portugais est très exigeant.
#     Quand a eu lieu votre dernier test auditif ?

高級用法

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-fr")
print(pipe("The Portuguese teacher is very demanding."))

# expected output: Le professeur de portugais est très exigeant.

📚 詳細文檔

模型信息

屬性	詳情
模型類型	英語到法語的神經機器翻譯模型
訓練數據	opusTCv20210807+bt ([來源](https://github.com/Helsinki - NLP/Tatoeba - Challenge))
發佈時間	2022 - 03 - 09
源語言	英語（eng）
目標語言	法語（fra）
模型架構	transformer - big
分詞方式	SentencePiece (spm32k,spm32k)
原始模型	[opusTCv20210807+bt_transformer - big_2022 - 03 - 09.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/eng - fra/opusTCv20210807+bt_transformer - big_2022 - 03 - 09.zip)
更多信息	[OPUS - MT eng - fra README](https://github.com/Helsinki - NLP/Tatoeba - Challenge/tree/master/models/eng - fra/README.md)

基準測試

語言對	測試集	chr - F	BLEU	句子數量	單詞數量
eng - fra	tatoeba - test - v2021 - 08 - 07	0.69621	53.2	12681	106378
eng - fra	flores101 - devtest	0.72494	52.2	1012	28343
eng - fra	multi30k_test_2016_flickr	0.72361	52.4	1000	13505
eng - fra	multi30k_test_2017_flickr	0.72826	52.8	1000	12118
eng - fra	multi30k_test_2017_mscoco	0.73547	54.7	461	5484
eng - fra	multi30k_test_2018_flickr	0.66723	43.7	1071	15867
eng - fra	newsdiscussdev2015	0.60471	33.4	1500	27940
eng - fra	newsdiscusstest2015	0.64915	40.3	1500	27975
eng - fra	newssyscomb2009	0.58903	30.7	502	12331
eng - fra	news - test2008	0.55516	27.6	2051	52685
eng - fra	newstest2009	0.57907	30.0	2525	69263
eng - fra	newstest2010	0.60156	33.5	2489	66022
eng - fra	newstest2011	0.61632	35.0	3003	80626
eng - fra	newstest2012	0.59736	32.8	3003	78011
eng - fra	newstest2013	0.59700	34.6	3000	70037
eng - fra	newstest2014	0.66686	41.9	3003	77306
eng - fra	tico19 - test	0.63022	40.6	2100	64661

基準測試文件

測試集翻譯：[opusTCv20210807+bt_transformer - big_2022 - 03 - 09.test.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/eng - fra/opusTCv20210807+bt_transformer - big_2022 - 03 - 09.test.txt)
測試集分數：[opusTCv20210807+bt_transformer - big_2022 - 03 - 09.eval.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/eng - fra/opusTCv20210807+bt_transformer - big_2022 - 03 - 09.eval.txt)
基準測試結果：benchmark_results.txt
基準測試輸出：benchmark_translations.zip

致謝

本工作得到以下項目支持：

[歐洲語言網格](https://www.european - language - grid.eu/) 的 [試點項目2866](https://live.european - language - grid.eu/catalogue/#/resource/projects/2866)。
[FoTran項目](https://www.helsinki.fi/en/researchgroups/natural - language - understanding - with - cross - lingual - grounding)，由歐盟“地平線2020”研究和創新計劃（資助協議編號771113）下的歐洲研究理事會（ERC）資助。
MeMAD項目，由歐盟“地平線2020”研究和創新計劃（資助協議編號780069）資助。

同時感謝 CSC -- IT Center for Science（芬蘭）提供的計算資源和IT基礎設施。