Model Overview
Model Features
Model Capabilities
Use Cases
🚀 SMALL-100模型
SMaLL-100是一個緊湊且快速的大規模多語言機器翻譯模型,涵蓋了超過10000種語言對。它在規模更小、速度更快的情況下,取得了與M2M-100相媲美的效果。該模型在這篇論文(已被EMNLP2022接收)中被提出,並最初發佈於這個倉庫。
🚀 快速開始
SMaLL-100模型架構和配置與M2M-100實現相同,但分詞器進行了修改以調整語言代碼。因此,目前你應該從tokenization_small100.py文件中本地加載分詞器。
演示地址:https://huggingface.co/spaces/alirezamsh/small100
⚠️ 重要提示
SMALL100Tokenizer需要sentencepiece,請確保通過以下命令安裝:
pip install sentencepiece
✨ 主要特性
- 緊湊快速:模型規模小,推理速度快。
- 多語言支持:涵蓋超過10000種語言對。
- 效果媲美:在性能上與M2M-100相競爭。
📦 安裝指南
安裝sentencepiece:
pip install sentencepiece
💻 使用示例
基礎用法
監督訓練示例
from transformers import M2M100ForConditionalGeneration
from tokenization_small100 import SMALL100Tokenizer
model = M2M100ForConditionalGeneration.from_pretrained("alirezamsh/small100")
tokenizer = SMALL100Tokenizer.from_pretrained("alirezamsh/small100", tgt_lang="fr")
src_text = "Life is like a box of chocolates."
tgt_text = "La vie est comme une boîte de chocolat."
model_inputs = tokenizer(src_text, text_target=tgt_text, return_tensors="pt")
loss = model(**model_inputs).loss # forward pass
訓練數據可根據需求提供。
生成示例
生成時使用的束搜索大小為5,最大目標長度為256。
from transformers import M2M100ForConditionalGeneration
from tokenization_small100 import SMALL100Tokenizer
hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
chinese_text = "生活就像一盒巧克力。"
model = M2M100ForConditionalGeneration.from_pretrained("alirezamsh/small100")
tokenizer = SMALL100Tokenizer.from_pretrained("alirezamsh/small100")
# 印地語到法語翻譯
tokenizer.tgt_lang = "fr"
encoded_hi = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "La vie est comme une boîte de chocolat."
# 中文到英語翻譯
tokenizer.tgt_lang = "en"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Life is like a box of chocolate."
📚 詳細文檔
評估
請參考原始倉庫進行spBLEU計算。
支持語言
該模型支持以下語言: 南非荷蘭語 (af)、阿姆哈拉語 (am)、阿拉伯語 (ar)、阿斯圖里亞斯語 (ast)、阿塞拜疆語 (az)、巴什基爾語 (ba)、白俄羅斯語 (be)、保加利亞語 (bg)、孟加拉語 (bn)、布列塔尼語 (br)、波斯尼亞語 (bs)、加泰羅尼亞語; 瓦倫西亞語 (ca)、宿務語 (ceb)、捷克語 (cs)、威爾士語 (cy)、丹麥語 (da)、德語 (de)、希臘語 (el)、英語 (en)、西班牙語 (es)、愛沙尼亞語 (et)、波斯語 (fa)、富拉語 (ff)、芬蘭語 (fi)、法語 (fr)、西弗里斯蘭語 (fy)、愛爾蘭語 (ga)、蓋爾語; 蘇格蘭蓋爾語 (gd)、加利西亞語 (gl)、古吉拉特語 (gu)、豪薩語 (ha)、希伯來語 (he)、印地語 (hi)、克羅地亞語 (hr)、海地克里奧爾語 (ht)、匈牙利語 (hu)、亞美尼亞語 (hy)、印尼語 (id)、伊博語 (ig)、伊洛卡諾語 (ilo)、冰島語 (is)、意大利語 (it)、日語 (ja)、爪哇語 (jv)、格魯吉亞語 (ka)、哈薩克語 (kk)、高棉語 (km)、卡納達語 (kn)、韓語 (ko)、盧森堡語 (lb)、幹達語 (lg)、林加拉語 (ln)、老撾語 (lo)、立陶宛語 (lt)、拉脫維亞語 (lv)、馬達加斯加語 (mg)、馬其頓語 (mk)、馬拉雅拉姆語 (ml)、蒙古語 (mn)、馬拉地語 (mr)、馬來語 (ms)、緬甸語 (my)、尼泊爾語 (ne)、荷蘭語; 佛蘭芒語 (nl)、挪威語 (no)、北索托語 (ns)、奧克語 (oc)、奧里亞語 (or)、旁遮普語 (pa)、波蘭語 (pl)、普什圖語 (ps)、葡萄牙語 (pt)、羅馬尼亞語; 摩爾多瓦語 (ro)、俄語 (ru)、信德語 (sd)、僧伽羅語 (si)、斯洛伐克語 (sk)、斯洛文尼亞語 (sl)、索馬里語 (so)、阿爾巴尼亞語 (sq)、塞爾維亞語 (sr)、斯瓦蒂語 (ss)、巽他語 (su)、瑞典語 (sv)、斯瓦希里語 (sw)、泰米爾語 (ta)、泰語 (th)、他加祿語 (tl)、茨瓦納語 (tn)、土耳其語 (tr)、烏克蘭語 (uk)、烏爾都語 (ur)、烏茲別克語 (uz)、越南語 (vi)、沃洛夫語 (wo)、科薩語 (xh)、意第緒語 (yi)、約魯巴語 (yo)、中文 (zh)、祖魯語 (zu)
📄 許可證
本項目採用MIT許可證。
📖 引用
如果您在研究中使用了該模型,請引用以下工作:
@inproceedings{mohammadshahi-etal-2022-small,
title = "{SM}a{LL}-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages",
author = "Mohammadshahi, Alireza and
Nikoulina, Vassilina and
Berard, Alexandre and
Brun, Caroline and
Henderson, James and
Besacier, Laurent",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.571",
pages = "8348--8359",
abstract = "In recent years, multilingual machine translation models have achieved promising performance on low-resource language pairs by sharing information between similar languages, thus enabling zero-shot translation. To overcome the {``}curse of multilinguality{''}, these models often opt for scaling up the number of parameters, which makes their use in resource-constrained environments challenging. We introduce SMaLL-100, a distilled version of the M2M-100(12B) model, a massively multilingual machine translation model covering 100 languages. We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages. We evaluate SMaLL-100 on different low-resource benchmarks: FLORES-101, Tatoeba, and TICO-19 and demonstrate that it outperforms previous massively multilingual models of comparable sizes (200-600M) while improving inference latency and memory usage. Additionally, our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference.",
}
@inproceedings{mohammadshahi-etal-2022-compressed,
title = "What Do Compressed Multilingual Machine Translation Models Forget?",
author = "Mohammadshahi, Alireza and
Nikoulina, Vassilina and
Berard, Alexandre and
Brun, Caroline and
Henderson, James and
Besacier, Laurent",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.317",
pages = "4308--4329",
abstract = "Recently, very large pre-trained models achieve state-of-the-art results in various natural language processing (NLP) tasks, but their size makes it more challenging to apply them in resource-constrained environments. Compression techniques allow to drastically reduce the size of the models and therefore their inference time with negligible impact on top-tier metrics. However, the general performance averaged across multiple tasks and/or languages may hide a drastic performance drop on under-represented features, which could result in the amplification of biases encoded by the models. In this work, we assess the impact of compression methods on Multilingual Neural Machine Translation models (MNMT) for various language groups, gender, and semantic biases by extensive analysis of compressed models on different machine translation benchmarks, i.e. FLORES-101, MT-Gender, and DiBiMT. We show that the performance of under-represented languages drops significantly, while the average BLEU metric only slightly decreases. Interestingly, the removal of noisy memorization with compression leads to a significant improvement for some medium-resource languages. Finally, we demonstrate that compression amplifies intrinsic gender and semantic biases, even in high-resource languages.",
}



