SONAR 200 Text Encoder
模型概述
模型特點
模型能力
使用案例
🚀 多語言SONAR文本編碼器
這是一個將多語言SONAR文本編碼器(https://huggingface.co/facebook/SONAR )從fairseq2
格式轉換為transformers
格式的項目。該編碼器可將文本轉換為向量表示,支持多達202種語言,在多語言文本處理場景中具有重要價值。
🚀 快速開始
此項目是將多語言SONAR文本編碼器從fairseq2
格式轉換為transformers
格式。其嵌入結果預計與官方實現(https://github.com/facebookresearch/SONAR )相同,但官方實現仍是最終參考標準。
該編碼器支持與 NLLB - 200 相同的202種語言(另見 源模型卡 和 FLORES - 200語言代碼映射)。
💻 使用示例
基礎用法
# !pip install transformers sentencepiece -q
import torch
from transformers import AutoTokenizer
from transformers.models.m2m_100.modeling_m2m_100 import M2M100Encoder
model_name = "cointegrated/SONAR_200_text_encoder"
encoder = M2M100Encoder.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def encode_mean_pool(texts, tokenizer, encoder, lang='eng_Latn', norm=False):
tokenizer.src_lang = lang
with torch.inference_mode():
batch = tokenizer(texts, return_tensors='pt', padding=True)
seq_embs = encoder(**batch).last_hidden_state
mask = batch.attention_mask
mean_emb = (seq_embs * mask.unsqueeze(-1)).sum(1) / mask.unsqueeze(-1).sum(1)
if norm:
mean_emb = torch.nn.functional.normalize(mean_emb)
return mean_emb
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
embs = encode_mean_pool(sentences, tokenizer, encoder, lang="eng_Latn")
print(embs.shape)
# torch.Size([2, 1024])
print(embs)
# tensor([[-0.0053, 0.0020, -0.0006, ..., 0.0094, -0.0009, 0.0070],
# [-0.0003, -0.0071, 0.0076, ..., 0.0055, 0.0022, -0.0083]])
高級用法
如需高級使用示例,請查看 https://github.com/facebookresearch/SONAR 中的README。
📄 許可證
本項目採用CC - BY - NC - 4.0許可證。
語言支持
該編碼器支持以下語言: ace、acm、acq、aeb、af、ajp、ak、am、apc、ar、ars、ary、arz、as、ast、awa、ay、azb、azj、ba、bm、ban、be、bem、bn、bho、bjn、bo、bs、bug、bg、ca、ceb、cs、cjk、ckb、crh、cy、da、de、dik、dyu、dz、el、en、eo、et、eu、ee、fo、fa、fj、fi、fon、fr、fur、ff、gd、ga、gl、gn、gu、ht、ha、he、hi、hne、hr、hu、hy、ig、ilo、id、is、it、jv、ja、kab、kac、kam、kn、ks、ka、kr、kk、kbp、kea、km、ki、rw、ky、kmb、kg、ko、kmr、lo、lv、lij、li、ln、lt、lmo、ltg、lb、lua、lg、luo、lus、mag、mai、ml、mr、min、mk、plt、mt、mni、mn、mos、mi、ms、my、nl、nn、nb、ne、nso、nus、ny、oc、gaz、ory、pag、pa、pap、pl、pt、prs、pbt、qu、ro、rn、ru、sg、sa、sat、scn、shn、si、sk、sl、sm、sn、sd、so、st、es、als、sc、sr、ss、su、sv、sw、szl、ta、tt、te、tg、tl、th、ti、taq、tpi、tn、ts、tk、tum、tr、tw、tzm、ug、uk、umb、ur、uz、vec、vi、war、wo、xh、yi、yo、yue、zh、zu
詳細語言信息如下: ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab, aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab, asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl, bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn, bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn, cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn, dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn, ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn, fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr, hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn, hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn, jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva, kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbp_Latn, kea_Latn, khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn, lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn, mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn, gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn, prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn, san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn, tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi, taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn
模型打包信息
模型在 此筆記本 中進行了重新打包。







