模型概述
模型特點
模型能力
使用案例
🚀 opus-mt-tc-bible-big-mul-mul
這是一個多語言翻譯模型,支持從多種語言翻譯到多種語言。模型基於大量公開數據訓練,能用於翻譯和文本生成任務,但受訓練數據限制,部分語言對的翻譯效果可能不佳。
🚀 快速開始
以下是使用該模型進行翻譯的簡單示例代碼:
from transformers import MarianMTModel, MarianTokenizer
src_text = [
">>rus<< You'd better not speak to Tom about that.",
">>ceb<< How are you?"
]
model_name = "pytorch-models/opus-mt-tc-bible-big-mul-mul"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
for t in translated:
print( tokenizer.decode(t, skip_special_tokens=True) )
# expected output:
# Лучше бы не поговорить с Томом об этом.
# Sa unsang paagi ikaw?
你也可以使用 transformers
庫的管道功能來使用 OPUS-MT 模型,示例如下:
from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-mul-mul")
print(pipe(">>rus<< You'd better not speak to Tom about that."))
# expected output: Лучше бы не поговорить с Томом об этом.
✨ 主要特性
- 多語言支持:支持眾多語言之間的翻譯。
- 廣泛應用:可用於翻譯和文本到文本的生成任務。
📦 安裝指南
文檔未提及安裝步驟,若需使用該模型,可參考 transformers
庫的官方文檔進行安裝。
💻 使用示例
基礎用法
from transformers import MarianMTModel, MarianTokenizer
src_text = [
">>rus<< You'd better not speak to Tom about that.",
">>ceb<< How are you?"
]
model_name = "pytorch-models/opus-mt-tc-bible-big-mul-mul"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
for t in translated:
print( tokenizer.decode(t, skip_special_tokens=True) )
# expected output:
# Лучше бы не поговорить с Томом об этом.
# Sa unsang paagi ikaw?
高級用法
from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-mul-mul")
print(pipe(">>rus<< You'd better not speak to Tom about that."))
# expected output: Лучше бы не поговорить с Томом об этом.
📚 詳細文檔
模型詳情
這是一個用於將多種語言(mul)翻譯成多種語言(mul)的神經機器翻譯模型。需要注意的是,由於大多數語言的訓練數據非常有限,模型對列表中的許多語言支持可能不佳。翻譯性能差異很大,對於大量的語言對,模型可能根本無法工作。
該模型是 OPUS-MT 項目 的一部分,該項目致力於讓神經機器翻譯模型廣泛適用於世界上的多種語言。所有模型最初使用 Marian NMT 這個出色的框架進行訓練,這是一個用純 C++ 編寫的高效 NMT 實現。這些模型通過 huggingface 的 transformers
庫轉換為 PyTorch 格式。訓練數據來自 OPUS,訓練管道採用 OPUS-MT-train 的流程。
模型描述:
屬性 | 詳情 |
---|---|
開發者 | 赫爾辛基大學語言技術研究小組 |
模型類型 | 翻譯(transformer-big) |
發佈時間 | 2024-08-17 |
許可證 | Apache-2.0 |
源語言 | aar abk ace ach acm ady afb afh afr aii ajp aka akl aln alt amh ami amu ang anp aoz apc ara arc arg arq arz asm ast atj ava avk awa ayl aze azz bak bal bam ban bar bas bcl bel bem ben bho bik bis bod bom bos bpy bre brx bua bug bul bvy byn bzt cak cat cay cbk ceb ces cha che chg chm chq chr chu chv chy cjk cjp cjy ckb cmn cnh cni cnr cop cor cos cre crh crk crs csb cym dag dan deu dik din diq div dje djk dng dop drt dsb dtp dty dws dyu dzo efi egl ell emx eng enm epo est eus evn ewe ext fao fas fij fil fin fkv fon fra frm fro frp frr fry fuc ful fur gag gbm gcf gil gla gle glg glk glv gor gos got grc grn gsw guc guj guw hat hau haw hbo hbs heb her hif hil hin hmn hne hnj hoc hrv hrx hsb hsn hun hus hye hyw iba ibo ido igs iii ike iku ile ilo ina ind inh ipk isl ita ixl izh jaa jak jam jav jbo jdt jpa jpn kaa kab kac kal kam kan kas kat kau kaz kbd kbp kea kek kha khm kik kin kir kiu kjh kmb kmr knc koi kok kom kon kpv krc krl ksh kua kum kur kxi laa lad lah lao lat lav lbe ldn lez lfn lij lim lin lit liv lkt lld lmo lou lrc ltz lua lug luo lus lut luy lzz mad mag mah mai mal mam mar max mdf meh mfa mfe mgm mic mix mkd mlg mlt mnc mni mnr mnw moh mol mon mos mri mrj msa mvv mwl mww mya myv mzn nap nau nav nbl nch nde nds nep new ngt ngu nhg nhn nia niu nld nlv nnb nno nob nog non nov npi nqo nso nst nus nya oar oci ofs oji ood ori orm orv osp oss ota otk pag pai pal pam pan pap pau pcd pck pcm pdc pes pfl phn pih pli plt pms pmy pnt pol por pot ppk ppl prg prs pus quc qxq qya rap rhg rif rmy roh rom ron rue run rup rus sag sah san sat scn sco sdh ses sgs shi shn shs shy sin sjn skr slk slv sma sme sml smn smo sna snd som sot spa sqi srd srn srp ssw stq sun swa swc swe swg swh syc syl syr szl tah tam taq tat tcy tel tet tgk tgl tha thv tig tir tkl tlh tly tmh tmr tmw toi ton tpi tpw trs trv tsn tso tts tuk tum tur tvl twi tyj tyv tzl tzm udm uig ukr umb urd usp uzb vec ven vep vie vls vol vot vro wae wal war wln wol wuu xal xcl xho xmf yid yor yua yue zam zap zea zgh zha zlm zsm zul zza |
目標語言 | aar abk ace ach acm ady afb afh_Latn afr aii_Syrc ajp aka akl_Latn aln alt amh ami ami_Latn amu_Latn ang_Latn anp aoz apc ara arc arg arq arz asm ast atj ava avk_Latn awa ayl aze_Cyrl aze_Latn azz azz_Latn bak bal bal_Latn bam_Latn ban bar bas bcl bel bem ben bho bik bis bod bom_Latn bos_Cyrl bos_Latn bpy bre brx bua bug bul bvy_Latn byn bzt_Latn cak cak_Latn cat cay cbk_Latn ceb ces cha che chg_Arab chg_Latn chm chq_Latn chr chu chv chy cjk cjk_Latn cjp_Latn cjy_Hans cjy_Hant ckb cmn cmn_Hans cmn_Hant cnh cnh_Latn cni_Latn cnr cnr_Latn cop cop_Copt cor cos cre cre_Latn crh crk crs csb csb_Latn cym dag_Latn dan deu dik din diq div dje djk djk_Latn dng dop_Latn drt_Latn dsb dtp dty dws_Latn dyu dzo efi egl ell emx_Latn eng enm_Latn epo est eus evn ewe ext fao fas fij fil fin fkv_Latn fon fra frm_Latn fro_Latn frp frr fry fuc ful fur gag gbm gcf gcf_Latn gil gla gle glg glk glv gor gos got got_Goth grc grn gsw guc guj guw guw_Latn hat hau_Latn haw hbo_Hebr hbs hbs_Cyrl hbs_Latn heb her hif_Latn hil hin hmn hne hnj hoc hrv hrx hsb hsn hun hus hus_Latn hye hyw iba ibo ido_Latn igs_Latn iii ike_Latn iku_Latn ile ile_Latn ilo ina_Latn ind inh inh_Latn ipk isl ita ixl_Latn izh jaa jaa_Bopo jaa_Hira jaa_Kana jaa_Yiii jak_Latn jam jav jav_Java jbo jbo_Cyrl jbo_Latn jdt_Cyrl jpa_Hebr jpn kaa kab kac kal kam kan kas_Arab kas_Deva kat kau kaz kaz_Cyrl kbd kbp kbp_Cans kbp_Ethi kbp_Geor kbp_Grek kbp_Hang kbp_Latn kbp_Mlym kbp_Yiii kea kek kek_Latn kha khm kik kin kir_Cyrl kiu kjh kmb kmr knc koi kok kom kon kpv krc krl ksh kua kum kur_Arab kur_Cyrl kur_Latn kxi_Latn laa_Latn lad lad_Latn lah lao lat lat_Latn lav lbe ldn_Latn lez lfn_Cyrl lfn_Latn lij lim lin lit liv_Latn lkt lld_Latn lmo lou_Latn lrc ltz lua lug luo lus lut_Latn luy lzz_Geor lzz_Latn mad mag mah mai mal mam mam_Latn mar max_Latn mdf meh_Latn mfa mfe mgm_Latn mic mix mix_Latn mkd mlg mlt mnc_Mong mni mnr_Latn mnw moh mol mon mos mri mrj msa_Arab msa_Latn mvv_Latn mwl mww mya myv mzn nap nau nav nbl nch nde nds nep new ngt_Latn ngu ngu_Latn nhg_Latn nhn_Latn nia niu nld nlv_Latn nnb_Latn nno nob nog non nov_Latn npi nqo nso nst_Latn nus nya oar_Hebr oar_Syrc oci ofs_Latn oji_Latn ood_Latn ori orm orv_Cyrl osp_Latn oss ota_Arab ota_Latn ota_Rohg ota_Syrc ota_Thaa ota_Yezi otk otk_Orkh pag pai_Latn pal pam pan pan_Guru pap pau pcd pck_Latn pcm pdc pes pfl phn_Phnx pih pih_Latn pli plt pms pmy_Latn pnt_Grek pol por pot_Latn ppk_Latn ppl_Latn prg_Latn prs pus quc qxq_Arab qxq_Latn qya qya_Latn rap rhg_Latn rif_Latn rmy roh rom rom_Cyrl ron rue run rup rus sag sah san san_Deva sat sat_Latn scn sco sdh ses sgs shi_Latn shn shs_Latn shy_Latn sin sjn_Latn skr slk slv sma sme sml_Latn smn smo sna snd_Arab som sot spa sqi srd srn srp_Cyrl ssw stq sun swa swc swe swg swh syc_Syrc syl_Sylo syr szl tah tam taq tat tcy tel tet tgk_Cyrl tgk_Latn tgl tgl_Latn tgl_Tglg tha thv tig tir tkl tlh tlh_Latn tly_Latn tmh tmr_Hebr tmw_Latn toi toi_Latn ton tpi tpw_Latn trs trs_Latn trv tsn tso tts tuk tuk_Cyrl tuk_Latn tum tur tvl twi tyj_Latn tyv tzl tzl_Latn tzm_Latn tzm_Tfng udm uig uig_Arab uig_Cyrl uig_Latn ukr umb urd usp_Latn uzb_Cyrl uzb_Latn vec ven vep vie vls vol_Latn vot vot_Latn vro wae wal war wln wol wuu xal xcl_Armn xcl_Latn xho xmf yid yor yua yue_Hans yue_Hant zam zap zea zgh zha zlm_Arab zlm_Latn zsm_Arab zsm_Latn zul zza |
原始模型 | opusTCv20230926+bt+jhubc_transformer-big_2024-08-17.zip |
更多信息資源 | OPUS-MT 儀表盤;OPUS-MT-train GitHub 倉庫;transformers 庫中 MarianNMT 模型的更多信息;Tatoeba 翻譯挑戰;HPLT 雙語數據 v1(作為 Tatoeba 翻譯挑戰數據集的一部分);大規模並行聖經語料庫 |
這是一個具有多個目標語言的多語言翻譯模型。需要以 >>id<<
(id = 有效的目標語言 ID)的形式提供句子的初始語言標記,例如 >>aar<<
。
用途
該模型可用於翻譯和文本到文本的生成任務。
風險、限制和偏差
⚠️ 重要提示
讀者應注意,該模型是在各種公共數據集上訓練的,這些數據集可能包含令人不安、冒犯性的內容,並且可能傳播歷史和當前的刻板印象。
大量研究已經探討了語言模型的偏差和公平性問題(例如,參見 Sheng 等人 (2021) 和 Bender 等人 (2021))。
還需注意的是,由於大多數語言的訓練數據非常有限,模型對列表中的許多語言支持可能不佳。翻譯性能差異很大,對於大量的語言對,模型可能根本無法工作。
🔧 技術細節
訓練
- 數據:opusTCv20230926+bt+jhubc(來源)
- 預處理:SentencePiece(spm64k,spm64k)
- 模型類型:transformer-big
- 原始 MarianNMT 模型:opusTCv20230926+bt+jhubc_transformer-big_2024-08-17.zip
- 訓練腳本:GitHub 倉庫
評估
- OPUS-MT 儀表盤上的模型得分
- 測試集翻譯:opusTCv20230926+bt+jhubc_transformer-big_2024-08-17.test.txt
- 測試集得分:opusTCv20230926+bt+jhubc_transformer-big_2024-08-17.eval.txt
- 基準測試結果:benchmark_results.txt
- 基準測試輸出:benchmark_translations.zip
語言對 | 測試集 | chr-F | BLEU | 句子數量 | 單詞數量 |
---|---|---|---|---|---|
multi-multi | tatoeba-test-v2020-07-28-v2023-09-26 | 0.51760 | 28.1 | 10000 | 73531 |
📄 許可證
本模型使用 Apache-2.0 許可證。
致謝
這項工作得到了 HPLT 項目 的支持,該項目由歐盟的“地平線歐洲”研究與創新計劃資助,資助協議編號為 101070350。我們也感謝 芬蘭 CSC -- 科學信息技術中心 和 歐洲高性能計算機 LUMI 提供的慷慨計算資源和 IT 基礎設施。
模型轉換信息
transformers
版本:4.45.1- OPUS-MT git 哈希值:0882077
- 轉換時間:Wed Oct 9 19:20:34 EEST 2024
- 轉換機器:LM0-400-22516.local
引用信息
如果使用此模型,請引用以下出版物:
- Democratizing neural machine translation with OPUS-MT
- OPUS-MT – Building open translation services for the World
- The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT
@article{tiedemann2023democratizing,
title={Democratizing neural machine translation with {OPUS-MT}},
author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
journal={Language Resources and Evaluation},
number={58},
pages={713--755},
year={2023},
publisher={Springer Nature},
issn={1574-0218},
doi={10.1007/s10579-023-09704-w}
}
@inproceedings{tiedemann-thottingal-2020-opus,
title = "{OPUS}-{MT} {--} Building open translation services for the World",
author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh},
booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
month = nov,
year = "2020",
address = "Lisboa, Portugal",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2020.eamt-1.61",
pages = "479--480",
}
@inproceedings{tiedemann-2020-tatoeba,
title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
author = {Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the Fifth Conference on Machine Translation",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.wmt-1.139",
pages = "1174--1182",
}



