Serengeti - E250 Open-source Multilingual Model - Freely Covering 517 African Languages to Fill the Technological Gap

Serengeti E250

Developed by UBC-NLP

SERENGETI is a large-scale multilingual pre-trained model covering 517 African languages and dialects, focusing on bridging the gap in technological resources for African languages.

Large Language Model

Transformers

Other#African Multilingual Model #Support for 517 Languages #Masked Language Prediction

Downloads 42

Release Time : 10/17/2023

Model Overview

This model is a multilingual pre-trained language model (mPLM) designed to support various natural language understanding tasks for African languages, enhancing the ability of African communities to access information in their native languages.

Model Features

Extensive Language Coverage

Covers 517 African languages and dialects, making it the largest multilingual model in the African NLP field to date.

Afrocentric Design

Adheres to Afrocentric NLP principles, prioritizing the needs of African communities and supporting language users and researchers.

Superior Multi-Task Performance

Achieves state-of-the-art performance on 11 datasets across eight natural language understanding tasks, with an average F1 score of 82.27.

Model Capabilities

Masked Language Modeling

Multilingual Text Understanding

African Language Support

Use Cases

Language Technology

Access to Information in African Languages

Helps non-proficient speakers of other languages access critical information in their native languages.

Enhances global connectivity for African communities.

Language Preservation

Provides opportunities to preserve various African languages, promoting their continued use across multiple domains.

African languages used in NLP tasks for the first time may inspire further technological development.

Academic Research

Linguistic Research

Supports researchers such as anthropologists and linguists in studying African languages.

Offers rich linguistic data and model support.

🚀 SERENGETI

SERENGETI is a set of massively multilingual language models that cover 517 African languages and language varieties, outperforming other models on multiple natural language understanding tasks.

🚀 Quick Start

Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task - specific finetuning. To date, only ~31 out of 2,000 African languages are covered in existing language models. We developed SERENGETI to address this limitation.

We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4 - 23 African languages. **SERENGETI** outperforms other models on 11 datasets across eight tasks, achieving 82.27 average F₁-score. We also perform error analyses to investigate the influence of language genealogy and linguistic similarity under zero - shot settings. We will publicly release our models for research. Further details about the model is available in the [(paper)](https://aclanthology.org/2023.findings-acl.97/).

Model Information

Property	Details
Pipeline Tag	fill - mask
Languages	aa, af, am, ak, bm, ff, fon, ha, ig, ki, lg, ln, mg, nr, om, rn, run, sw, sn, tn, ti, ve, wo, xh, yo, zu
Tags	Masked Langauge Model
Widget Examples	1. ẹ jọwọ , ẹ mi. 2. gbọ́ láìfọ̀rọ̀ gùn rárá.

📦 Installation

No specific installation steps provided in the original document.

💻 Usage Examples

Basic Usage

Below is an example for using Serengeti to predict masked tokens.

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/serengeti-E250", use_auth_token="XXX")

model = AutoModelForMaskedLM.from_pretrained("UBC-NLP/serengeti-E250", use_auth_token="XXX")
from transformers import pipeline

classifier = pipeline("fill-mask", model=model, tokenizer=tokenizer)
classifier("ẹ jọwọ , ẹ <mask> mi") #Yoruba
[{'score': 0.07887924462556839,
  'token': 8418,
  'token_str': 'ọmọ',
  'sequence': 'ẹ jọwọ, ẹ ọmọ mi'},
 {'score': 0.04658124968409538,
  'token': 156595,
  'token_str': 'fẹ́ràn',
  'sequence': 'ẹ jọwọ, ẹ fẹ́ràn mi'},
 {'score': 0.029315846040844917,
  'token': 204050,
  'token_str': 'gbàgbé',
  'sequence': 'ẹ jọwọ, ẹ gbàgbé mi'},
 {'score': 0.02790883742272854,
  'token': 10730,
  'token_str': 'kọ',
  'sequence': 'ẹ jọwọ, ẹ kọ mi'},
 {'score': 0.022904086858034134,
  'token': 115382,
  'token_str': 'bẹ̀rù',
  'sequence': 'ẹ jọwọ, ẹ bẹ̀rù mi'}]

For more details, please read this notebook

📚 Documentation

Ethics

Serengeti aligns with Afrocentric NLP, considering the needs of African people when developing technology. We believe it will be useful for language speakers and researchers. Here are some use - cases and impacts:

Serengeti aims to address the lack of technology access in about 90% of the world's languages, especially focusing on Africa. It's the first massively multilingual PLM for African languages and varieties, covering 517 languages, the largest to date for African NLP.
It enables better access to important information for the African community in Indigenous African languages, benefiting non - fluent speakers of other languages and potentially connecting more people globally.
Serengeti offers opportunities for language preservation of many African languages. It consists of languages not used in NLP tasks until now and can encourage their continued use and future language technology development.
To mitigate discrimination and bias, we manually curate datasets. Native speakers of multiple languages evaluated a data subset. The data is from various domains to better represent native language usage.
Although LMs are useful, they can be misused. Serengeti is developed using public datasets that may have biases. Our analyses are not comprehensive, and we lack access to native speakers of most covered languages, which hinders bias investigation.

Supported Languages

Please refer to [suported - languages](./supported - languages.txt)

Citation

If you use the pre - trained model (Serengeti) for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (to be updated):

@inproceedings{adebara-etal-2023-serengeti,
    title = "{SERENGETI}: Massively Multilingual Language Models for {A}frica",
    author = "Adebara, Ife  and
      Elmadany, AbdelRahim  and
      Abdul - Mageed, Muhammad  and
      Alcoba Inciarte, Alcides",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings - acl.97",
    doi = "10.18653/v1/2023.findings - acl.97",
    pages = "1498--1537",
}

Acknowledgments

We gratefully acknowledge support from Canada Research Chairs (CRC), the Natural Sciences and Engineering Research Council of Canada (NSERC; RGPIN - 2018 - 04267), the Social Sciences and Humanities Research Council of Canada (SSHRC; 435 - 2018 - 0576; 895 - 2020 - 1004; 895 - 2021 - 1008), Canadian Foundation for Innovation (CFI; 37771), Digital Research Alliance of Canada, [UBC ARC - Sockeye](https://arc.ubc.ca/ubc - arc - sockeye), Advanced Micro Devices, Inc. (AMD), and Google. Any opinions, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of CRC, NSERC, SSHRC, CFI, the Alliance, AMD, Google, or UBC ARC - Sockeye.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご