🚀 AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
AfroLM is a multilingual pretrained language model based on self - active learning, covering 23 African languages. It shows excellent performance in evaluations and is very data - efficient.
This repository contains the model for our paper AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
which will appear at the Third Simple and Efficient Natural Language Processing, at EMNLP 2022. You can access the GitHub Repository of the Paper.
🚀 Quick Start
Our self - active learning framework

Languages Covered
AfroLM has been pretrained from scratch on 23 African Languages: Amharic, Afan Oromo, Bambara, Ghomalá, Éwé, Fon, Hausa, Ìgbò, Kinyarwanda, Lingala, Luganda, Luo, Mooré, Chewa, Naija, Shona, Swahili, Setswana, Twi, Wolof, Xhosa, Yorùbá, and Zulu.
Evaluation Results
AfroLM was evaluated on MasakhaNER1.0 (10 African Languages) and MasakhaNER2.0 (21 African Languages) datasets; on text classification and sentiment analysis. AfroLM outperformed AfriBERTa, mBERT, and XLMR - base, and was very competitive with AfroXLMR. AfroLM is also very data efficient because it was pretrained on a dataset 14x+ smaller than its competitors' datasets. Below are the average F1 - score performances of various models, across various datasets. Please consult our paper for more language - level performance.
Model |
MasakhaNER |
MasakhaNER2.0* |
Text Classification (Yoruba/Hausa) |
Sentiment Analysis (YOSM) |
OOD Sentiment Analysis (Twitter -> YOSM) |
AfroLM - Large |
80.13 |
83.26 |
82.90/91.00 |
85.40 |
68.70 |
AfriBERTa |
79.10 |
81.31 |
83.22/90.86 |
82.70 |
65.90 |
mBERT |
71.55 |
80.68 |
--- |
--- |
--- |
XLMR - base |
79.16 |
83.09 |
--- |
--- |
--- |
AfroXLMR - base |
81.90 |
84.55 |
--- |
--- |
--- |
- (*) The evaluation was made on the 11 additional languages of the dataset.
- Bold numbers represent the performance of the model with the smallest pretrained data.
Pretrained Models and Dataset
Models: AfroLM - Large and Dataset: AfroLM Dataset
💻 Usage Examples
Basic Usage
from transformers import XLMRobertaModel, XLMRobertaTokenizer
model = XLMRobertaModel.from_pretrained("bonadossou/afrolm_active_learning")
tokenizer = XLMRobertaTokenizer.from_pretrained("bonadossou/afrolm_active_learning")
tokenizer.model_max_length = 256
Autotokenizer
class does not successfully load our tokenizer. So we recommend to use directly the XLMRobertaTokenizer
class. Depending on your task, you will load the according mode of the model. Read the [XLMRoberta Documentation](https://huggingface.co/docs/transformers/model_doc/xlm - roberta)
Reproducing our result: Training and Evaluation
- To train the network, run
python active_learning.py
. You can also wrap it around a bash
script.
- For the evaluation:
- NER Classification:
bash ner_experiments.sh
- Text Classification & Sentiment Analysis:
bash text_classification_all.sh
📄 License
The project is licensed under the cc - by - 4.0 license.
📚 Documentation
Annotations and Languages
Property |
Details |
Annotations Creators |
crowdsourced |
Language |
amh, orm, lin, hau, ibo, kin, lug, luo, pcm, swa, wol, yor, bam, bbj, ewe, fon, mos, nya, sna, tsn, twi, xho, zul |
Language Creators |
crowdsourced |
Multilinguality |
monolingual |
Pretty Name |
afrolm - dataset |
Size Categories |
1M < n < 10M |
Source Datasets |
original |
Tags |
afrolm, active learning, language modeling, research papers, natural language processing, self - active learning |
Task Categories |
fill - mask |
Task IDs |
masked - language - modeling |
Citation
@inproceedings{dossou-etal-2022-afrolm,
title = "{A}fro{LM}: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 {A}frican Languages",
author = "Dossou, Bonaventure F. P. and
Tonja, Atnafu Lambebo and
Yousuf, Oreen and
Osei, Salomey and
Oppong, Abigail and
Shode, Iyanuoluwa and
Awoyomi, Oluwabusayo Olufunke and
Emezue, Chris",
booktitle = "Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates (Hybrid)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.sustainlp-1.11",
pages = "52--64"
}
We will share the official proceeding citation as soon as possible. Stay tuned, and if you have liked our work, give it a star.
Reach out
Do you have a question? Please create an issue and we will reach out as soon as possible.