模型简介
模型特点
模型能力
使用案例
🚀 opus-mt-tc-bible-big-mul-mul
这是一个多语言翻译模型,支持从多种语言翻译到多种语言。模型基于大量公开数据训练,能用于翻译和文本生成任务,但受训练数据限制,部分语言对的翻译效果可能不佳。
🚀 快速开始
以下是使用该模型进行翻译的简单示例代码:
from transformers import MarianMTModel, MarianTokenizer
src_text = [
">>rus<< You'd better not speak to Tom about that.",
">>ceb<< How are you?"
]
model_name = "pytorch-models/opus-mt-tc-bible-big-mul-mul"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
for t in translated:
print( tokenizer.decode(t, skip_special_tokens=True) )
# expected output:
# Лучше бы не поговорить с Томом об этом.
# Sa unsang paagi ikaw?
你也可以使用 transformers
库的管道功能来使用 OPUS-MT 模型,示例如下:
from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-mul-mul")
print(pipe(">>rus<< You'd better not speak to Tom about that."))
# expected output: Лучше бы не поговорить с Томом об этом.
✨ 主要特性
- 多语言支持:支持众多语言之间的翻译。
- 广泛应用:可用于翻译和文本到文本的生成任务。
📦 安装指南
文档未提及安装步骤,若需使用该模型,可参考 transformers
库的官方文档进行安装。
💻 使用示例
基础用法
from transformers import MarianMTModel, MarianTokenizer
src_text = [
">>rus<< You'd better not speak to Tom about that.",
">>ceb<< How are you?"
]
model_name = "pytorch-models/opus-mt-tc-bible-big-mul-mul"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
for t in translated:
print( tokenizer.decode(t, skip_special_tokens=True) )
# expected output:
# Лучше бы не поговорить с Томом об этом.
# Sa unsang paagi ikaw?
高级用法
from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-mul-mul")
print(pipe(">>rus<< You'd better not speak to Tom about that."))
# expected output: Лучше бы не поговорить с Томом об этом.
📚 详细文档
模型详情
这是一个用于将多种语言(mul)翻译成多种语言(mul)的神经机器翻译模型。需要注意的是,由于大多数语言的训练数据非常有限,模型对列表中的许多语言支持可能不佳。翻译性能差异很大,对于大量的语言对,模型可能根本无法工作。
该模型是 OPUS-MT 项目 的一部分,该项目致力于让神经机器翻译模型广泛适用于世界上的多种语言。所有模型最初使用 Marian NMT 这个出色的框架进行训练,这是一个用纯 C++ 编写的高效 NMT 实现。这些模型通过 huggingface 的 transformers
库转换为 PyTorch 格式。训练数据来自 OPUS,训练管道采用 OPUS-MT-train 的流程。
模型描述:
属性 | 详情 |
---|---|
开发者 | 赫尔辛基大学语言技术研究小组 |
模型类型 | 翻译(transformer-big) |
发布时间 | 2024-08-17 |
许可证 | Apache-2.0 |
源语言 | aar abk ace ach acm ady afb afh afr aii ajp aka akl aln alt amh ami amu ang anp aoz apc ara arc arg arq arz asm ast atj ava avk awa ayl aze azz bak bal bam ban bar bas bcl bel bem ben bho bik bis bod bom bos bpy bre brx bua bug bul bvy byn bzt cak cat cay cbk ceb ces cha che chg chm chq chr chu chv chy cjk cjp cjy ckb cmn cnh cni cnr cop cor cos cre crh crk crs csb cym dag dan deu dik din diq div dje djk dng dop drt dsb dtp dty dws dyu dzo efi egl ell emx eng enm epo est eus evn ewe ext fao fas fij fil fin fkv fon fra frm fro frp frr fry fuc ful fur gag gbm gcf gil gla gle glg glk glv gor gos got grc grn gsw guc guj guw hat hau haw hbo hbs heb her hif hil hin hmn hne hnj hoc hrv hrx hsb hsn hun hus hye hyw iba ibo ido igs iii ike iku ile ilo ina ind inh ipk isl ita ixl izh jaa jak jam jav jbo jdt jpa jpn kaa kab kac kal kam kan kas kat kau kaz kbd kbp kea kek kha khm kik kin kir kiu kjh kmb kmr knc koi kok kom kon kpv krc krl ksh kua kum kur kxi laa lad lah lao lat lav lbe ldn lez lfn lij lim lin lit liv lkt lld lmo lou lrc ltz lua lug luo lus lut luy lzz mad mag mah mai mal mam mar max mdf meh mfa mfe mgm mic mix mkd mlg mlt mnc mni mnr mnw moh mol mon mos mri mrj msa mvv mwl mww mya myv mzn nap nau nav nbl nch nde nds nep new ngt ngu nhg nhn nia niu nld nlv nnb nno nob nog non nov npi nqo nso nst nus nya oar oci ofs oji ood ori orm orv osp oss ota otk pag pai pal pam pan pap pau pcd pck pcm pdc pes pfl phn pih pli plt pms pmy pnt pol por pot ppk ppl prg prs pus quc qxq qya rap rhg rif rmy roh rom ron rue run rup rus sag sah san sat scn sco sdh ses sgs shi shn shs shy sin sjn skr slk slv sma sme sml smn smo sna snd som sot spa sqi srd srn srp ssw stq sun swa swc swe swg swh syc syl syr szl tah tam taq tat tcy tel tet tgk tgl tha thv tig tir tkl tlh tly tmh tmr tmw toi ton tpi tpw trs trv tsn tso tts tuk tum tur tvl twi tyj tyv tzl tzm udm uig ukr umb urd usp uzb vec ven vep vie vls vol vot vro wae wal war wln wol wuu xal xcl xho xmf yid yor yua yue zam zap zea zgh zha zlm zsm zul zza |
目标语言 | aar abk ace ach acm ady afb afh_Latn afr aii_Syrc ajp aka akl_Latn aln alt amh ami ami_Latn amu_Latn ang_Latn anp aoz apc ara arc arg arq arz asm ast atj ava avk_Latn awa ayl aze_Cyrl aze_Latn azz azz_Latn bak bal bal_Latn bam_Latn ban bar bas bcl bel bem ben bho bik bis bod bom_Latn bos_Cyrl bos_Latn bpy bre brx bua bug bul bvy_Latn byn bzt_Latn cak cak_Latn cat cay cbk_Latn ceb ces cha che chg_Arab chg_Latn chm chq_Latn chr chu chv chy cjk cjk_Latn cjp_Latn cjy_Hans cjy_Hant ckb cmn cmn_Hans cmn_Hant cnh cnh_Latn cni_Latn cnr cnr_Latn cop cop_Copt cor cos cre cre_Latn crh crk crs csb csb_Latn cym dag_Latn dan deu dik din diq div dje djk djk_Latn dng dop_Latn drt_Latn dsb dtp dty dws_Latn dyu dzo efi egl ell emx_Latn eng enm_Latn epo est eus evn ewe ext fao fas fij fil fin fkv_Latn fon fra frm_Latn fro_Latn frp frr fry fuc ful fur gag gbm gcf gcf_Latn gil gla gle glg glk glv gor gos got got_Goth grc grn gsw guc guj guw guw_Latn hat hau_Latn haw hbo_Hebr hbs hbs_Cyrl hbs_Latn heb her hif_Latn hil hin hmn hne hnj hoc hrv hrx hsb hsn hun hus hus_Latn hye hyw iba ibo ido_Latn igs_Latn iii ike_Latn iku_Latn ile ile_Latn ilo ina_Latn ind inh inh_Latn ipk isl ita ixl_Latn izh jaa jaa_Bopo jaa_Hira jaa_Kana jaa_Yiii jak_Latn jam jav jav_Java jbo jbo_Cyrl jbo_Latn jdt_Cyrl jpa_Hebr jpn kaa kab kac kal kam kan kas_Arab kas_Deva kat kau kaz kaz_Cyrl kbd kbp kbp_Cans kbp_Ethi kbp_Geor kbp_Grek kbp_Hang kbp_Latn kbp_Mlym kbp_Yiii kea kek kek_Latn kha khm kik kin kir_Cyrl kiu kjh kmb kmr knc koi kok kom kon kpv krc krl ksh kua kum kur_Arab kur_Cyrl kur_Latn kxi_Latn laa_Latn lad lad_Latn lah lao lat lat_Latn lav lbe ldn_Latn lez lfn_Cyrl lfn_Latn lij lim lin lit liv_Latn lkt lld_Latn lmo lou_Latn lrc ltz lua lug luo lus lut_Latn luy lzz_Geor lzz_Latn mad mag mah mai mal mam mam_Latn mar max_Latn mdf meh_Latn mfa mfe mgm_Latn mic mix mix_Latn mkd mlg mlt mnc_Mong mni mnr_Latn mnw moh mol mon mos mri mrj msa_Arab msa_Latn mvv_Latn mwl mww mya myv mzn nap nau nav nbl nch nde nds nep new ngt_Latn ngu ngu_Latn nhg_Latn nhn_Latn nia niu nld nlv_Latn nnb_Latn nno nob nog non nov_Latn npi nqo nso nst_Latn nus nya oar_Hebr oar_Syrc oci ofs_Latn oji_Latn ood_Latn ori orm orv_Cyrl osp_Latn oss ota_Arab ota_Latn ota_Rohg ota_Syrc ota_Thaa ota_Yezi otk otk_Orkh pag pai_Latn pal pam pan pan_Guru pap pau pcd pck_Latn pcm pdc pes pfl phn_Phnx pih pih_Latn pli plt pms pmy_Latn pnt_Grek pol por pot_Latn ppk_Latn ppl_Latn prg_Latn prs pus quc qxq_Arab qxq_Latn qya qya_Latn rap rhg_Latn rif_Latn rmy roh rom rom_Cyrl ron rue run rup rus sag sah san san_Deva sat sat_Latn scn sco sdh ses sgs shi_Latn shn shs_Latn shy_Latn sin sjn_Latn skr slk slv sma sme sml_Latn smn smo sna snd_Arab som sot spa sqi srd srn srp_Cyrl ssw stq sun swa swc swe swg swh syc_Syrc syl_Sylo syr szl tah tam taq tat tcy tel tet tgk_Cyrl tgk_Latn tgl tgl_Latn tgl_Tglg tha thv tig tir tkl tlh tlh_Latn tly_Latn tmh tmr_Hebr tmw_Latn toi toi_Latn ton tpi tpw_Latn trs trs_Latn trv tsn tso tts tuk tuk_Cyrl tuk_Latn tum tur tvl twi tyj_Latn tyv tzl tzl_Latn tzm_Latn tzm_Tfng udm uig uig_Arab uig_Cyrl uig_Latn ukr umb urd usp_Latn uzb_Cyrl uzb_Latn vec ven vep vie vls vol_Latn vot vot_Latn vro wae wal war wln wol wuu xal xcl_Armn xcl_Latn xho xmf yid yor yua yue_Hans yue_Hant zam zap zea zgh zha zlm_Arab zlm_Latn zsm_Arab zsm_Latn zul zza |
原始模型 | opusTCv20230926+bt+jhubc_transformer-big_2024-08-17.zip |
更多信息资源 | OPUS-MT 仪表盘;OPUS-MT-train GitHub 仓库;transformers 库中 MarianNMT 模型的更多信息;Tatoeba 翻译挑战;HPLT 双语数据 v1(作为 Tatoeba 翻译挑战数据集的一部分);大规模并行圣经语料库 |
这是一个具有多个目标语言的多语言翻译模型。需要以 >>id<<
(id = 有效的目标语言 ID)的形式提供句子的初始语言标记,例如 >>aar<<
。
用途
该模型可用于翻译和文本到文本的生成任务。
风险、限制和偏差
⚠️ 重要提示
读者应注意,该模型是在各种公共数据集上训练的,这些数据集可能包含令人不安、冒犯性的内容,并且可能传播历史和当前的刻板印象。
大量研究已经探讨了语言模型的偏差和公平性问题(例如,参见 Sheng 等人 (2021) 和 Bender 等人 (2021))。
还需注意的是,由于大多数语言的训练数据非常有限,模型对列表中的许多语言支持可能不佳。翻译性能差异很大,对于大量的语言对,模型可能根本无法工作。
🔧 技术细节
训练
- 数据:opusTCv20230926+bt+jhubc(来源)
- 预处理:SentencePiece(spm64k,spm64k)
- 模型类型:transformer-big
- 原始 MarianNMT 模型:opusTCv20230926+bt+jhubc_transformer-big_2024-08-17.zip
- 训练脚本:GitHub 仓库
评估
- OPUS-MT 仪表盘上的模型得分
- 测试集翻译:opusTCv20230926+bt+jhubc_transformer-big_2024-08-17.test.txt
- 测试集得分:opusTCv20230926+bt+jhubc_transformer-big_2024-08-17.eval.txt
- 基准测试结果:benchmark_results.txt
- 基准测试输出:benchmark_translations.zip
语言对 | 测试集 | chr-F | BLEU | 句子数量 | 单词数量 |
---|---|---|---|---|---|
multi-multi | tatoeba-test-v2020-07-28-v2023-09-26 | 0.51760 | 28.1 | 10000 | 73531 |
📄 许可证
本模型使用 Apache-2.0 许可证。
致谢
这项工作得到了 HPLT 项目 的支持,该项目由欧盟的“地平线欧洲”研究与创新计划资助,资助协议编号为 101070350。我们也感谢 芬兰 CSC -- 科学信息技术中心 和 欧洲高性能计算机 LUMI 提供的慷慨计算资源和 IT 基础设施。
模型转换信息
transformers
版本:4.45.1- OPUS-MT git 哈希值:0882077
- 转换时间:Wed Oct 9 19:20:34 EEST 2024
- 转换机器:LM0-400-22516.local
引用信息
如果使用此模型,请引用以下出版物:
- Democratizing neural machine translation with OPUS-MT
- OPUS-MT – Building open translation services for the World
- The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT
@article{tiedemann2023democratizing,
title={Democratizing neural machine translation with {OPUS-MT}},
author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
journal={Language Resources and Evaluation},
number={58},
pages={713--755},
year={2023},
publisher={Springer Nature},
issn={1574-0218},
doi={10.1007/s10579-023-09704-w}
}
@inproceedings{tiedemann-thottingal-2020-opus,
title = "{OPUS}-{MT} {--} Building open translation services for the World",
author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh},
booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
month = nov,
year = "2020",
address = "Lisboa, Portugal",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2020.eamt-1.61",
pages = "479--480",
}
@inproceedings{tiedemann-2020-tatoeba,
title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
author = {Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the Fifth Conference on Machine Translation",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.wmt-1.139",
pages = "1174--1182",
}



