🚀 SERENGETI
SERENGETI is a set of massively multilingual language models that cover 517 African languages and language varieties, outperforming other models on multiple natural language understanding tasks.
🚀 Quick Start
Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task - specific finetuning. To date, only ~31 out of 2,000 African languages are covered in existing language models. We developed SERENGETI to address this limitation.
We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4 - 23 African languages. **SERENGETI** outperforms other models on 11 datasets across eight tasks, achieving 82.27 average F1-score. We also perform error analyses to investigate the influence of language genealogy and linguistic similarity under zero - shot settings. We will publicly release our models for research.
Further details about the model is available in the [(paper)](https://aclanthology.org/2023.findings-acl.97/).
Model Information
Property |
Details |
Pipeline Tag |
fill - mask |
Languages |
aa, af, am, ak, bm, ff, fon, ha, ig, ki, lg, ln, mg, nr, om, rn, run, sw, sn, tn, ti, ve, wo, xh, yo, zu |
Tags |
Masked Langauge Model |
Widget Examples |
1. ẹ jọwọ , ẹ mi. 2. gbọ́ láìfọ̀rọ̀ gùn rárá. |
📦 Installation
No specific installation steps provided in the original document.
💻 Usage Examples
Basic Usage
Below is an example for using Serengeti to predict masked tokens.
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/serengeti-E250", use_auth_token="XXX")
model = AutoModelForMaskedLM.from_pretrained("UBC-NLP/serengeti-E250", use_auth_token="XXX")
from transformers import pipeline
classifier = pipeline("fill-mask", model=model, tokenizer=tokenizer)
classifier("ẹ jọwọ , ẹ <mask> mi")
[{'score': 0.07887924462556839,
'token': 8418,
'token_str': 'ọmọ',
'sequence': 'ẹ jọwọ, ẹ ọmọ mi'},
{'score': 0.04658124968409538,
'token': 156595,
'token_str': 'fẹ́ràn',
'sequence': 'ẹ jọwọ, ẹ fẹ́ràn mi'},
{'score': 0.029315846040844917,
'token': 204050,
'token_str': 'gbàgbé',
'sequence': 'ẹ jọwọ, ẹ gbàgbé mi'},
{'score': 0.02790883742272854,
'token': 10730,
'token_str': 'kọ',
'sequence': 'ẹ jọwọ, ẹ kọ mi'},
{'score': 0.022904086858034134,
'token': 115382,
'token_str': 'bẹ̀rù',
'sequence': 'ẹ jọwọ, ẹ bẹ̀rù mi'}]
For more details, please read this notebook 
📚 Documentation
Ethics
Serengeti aligns with Afrocentric NLP, considering the needs of African people when developing technology. We believe it will be useful for language speakers and researchers. Here are some use - cases and impacts:
- Serengeti aims to address the lack of technology access in about 90% of the world's languages, especially focusing on Africa. It's the first massively multilingual PLM for African languages and varieties, covering 517 languages, the largest to date for African NLP.
- It enables better access to important information for the African community in Indigenous African languages, benefiting non - fluent speakers of other languages and potentially connecting more people globally.
- Serengeti offers opportunities for language preservation of many African languages. It consists of languages not used in NLP tasks until now and can encourage their continued use and future language technology development.
- To mitigate discrimination and bias, we manually curate datasets. Native speakers of multiple languages evaluated a data subset. The data is from various domains to better represent native language usage.
- Although LMs are useful, they can be misused. Serengeti is developed using public datasets that may have biases. Our analyses are not comprehensive, and we lack access to native speakers of most covered languages, which hinders bias investigation.
Supported Languages
Please refer to [suported - languages](./supported - languages.txt)
Citation
If you use the pre - trained model (Serengeti) for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (to be updated):
@inproceedings{adebara-etal-2023-serengeti,
title = "{SERENGETI}: Massively Multilingual Language Models for {A}frica",
author = "Adebara, Ife and
Elmadany, AbdelRahim and
Abdul - Mageed, Muhammad and
Alcoba Inciarte, Alcides",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings - acl.97",
doi = "10.18653/v1/2023.findings - acl.97",
pages = "1498--1537",
}
Acknowledgments
We gratefully acknowledge support from Canada Research Chairs (CRC), the Natural Sciences and Engineering Research Council of Canada (NSERC; RGPIN - 2018 - 04267), the Social Sciences and Humanities Research Council of Canada (SSHRC; 435 - 2018 - 0576; 895 - 2020 - 1004; 895 - 2021 - 1008), Canadian Foundation for Innovation (CFI; 37771), Digital Research Alliance of Canada, [UBC ARC - Sockeye](https://arc.ubc.ca/ubc - arc - sockeye), Advanced Micro Devices, Inc. (AMD), and Google. Any opinions, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of CRC, NSERC, SSHRC, CFI, the Alliance, AMD, Google, or UBC ARC - Sockeye.