AfroLM Open-Source Language Model - Use the Autonomous Learning Framework to Handle 23 African Language Applications with a Small Amount of Data

Afrolm Active Learning

Developed by bonadossou

AfroLM is a pretrained language model optimized for 23 African languages, employing an active learning framework to achieve high performance with minimal data

Large Language Model

Transformers

Other#African Multilingual Model #Active Learning #Low-resource NLP

Downloads 132

Release Time : 10/28/2022

Model Overview

This model is efficiently pretrained on 23 African languages through an innovative active learning framework, excelling in tasks like named entity recognition and text classification while requiring less than 14% of competitors' training data to achieve comparable performance

Model Features

Efficient Data Utilization

Achieves comparable performance using less than 14% of competitors' pretraining data

Multilingual Support

Covers 23 African languages, including many low-resource languages

Active Learning Framework

Employs an innovative active learning approach to optimize training

Lightweight & Efficient

Features fewer parameters and higher efficiency compared to similar models

Model Capabilities

Named Entity Recognition

Text Classification

Sentiment Analysis

Cross-lingual Transfer Learning

Use Cases

Natural Language Processing

African Language Named Entity Recognition

Entity recognition on MasakhaNER dataset

Average F1 scores: 80.13 (MasakhaNER1.0) and 83.26 (MasakhaNER2.0)

Text Classification

Yoruba and Hausa text classification

Achieved accuracy rates of 82.90% and 91.00% respectively

Sentiment Analysis

Yoruba social media (YOSM) sentiment analysis

85.40% accuracy

🚀 AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages

AfroLM is a multilingual pretrained language model based on self - active learning, covering 23 African languages. It shows excellent performance in evaluations and is very data - efficient.

This repository contains the model for our paper AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages which will appear at the Third Simple and Efficient Natural Language Processing, at EMNLP 2022. You can access the GitHub Repository of the Paper.

🚀 Quick Start

Our self - active learning framework

Model

Languages Covered

AfroLM has been pretrained from scratch on 23 African Languages: Amharic, Afan Oromo, Bambara, Ghomalá, Éwé, Fon, Hausa, Ìgbò, Kinyarwanda, Lingala, Luganda, Luo, Mooré, Chewa, Naija, Shona, Swahili, Setswana, Twi, Wolof, Xhosa, Yorùbá, and Zulu.

Evaluation Results

AfroLM was evaluated on MasakhaNER1.0 (10 African Languages) and MasakhaNER2.0 (21 African Languages) datasets; on text classification and sentiment analysis. AfroLM outperformed AfriBERTa, mBERT, and XLMR - base, and was very competitive with AfroXLMR. AfroLM is also very data efficient because it was pretrained on a dataset 14x+ smaller than its competitors' datasets. Below are the average F1 - score performances of various models, across various datasets. Please consult our paper for more language - level performance.

Model	MasakhaNER	MasakhaNER2.0*	Text Classification (Yoruba/Hausa)	Sentiment Analysis (YOSM)	OOD Sentiment Analysis (Twitter -> YOSM)
`AfroLM - Large`	80.13	83.26	82.90/91.00	85.40	68.70
`AfriBERTa`	79.10	81.31	83.22/90.86	82.70	65.90
`mBERT`	71.55	80.68	---	---	---
`XLMR - base`	79.16	83.09	---	---	---
`AfroXLMR - base`	`81.90`	`84.55`	---	---	---

(*) The evaluation was made on the 11 additional languages of the dataset.
Bold numbers represent the performance of the model with the smallest pretrained data.

Pretrained Models and Dataset

Models: AfroLM - Large and Dataset: AfroLM Dataset

💻 Usage Examples

Basic Usage

from transformers import XLMRobertaModel, XLMRobertaTokenizer
model = XLMRobertaModel.from_pretrained("bonadossou/afrolm_active_learning")
tokenizer = XLMRobertaTokenizer.from_pretrained("bonadossou/afrolm_active_learning")
tokenizer.model_max_length = 256

Autotokenizer class does not successfully load our tokenizer. So we recommend to use directly the XLMRobertaTokenizer class. Depending on your task, you will load the according mode of the model. Read the [XLMRoberta Documentation](https://huggingface.co/docs/transformers/model_doc/xlm - roberta)

Reproducing our result: Training and Evaluation

To train the network, run python active_learning.py. You can also wrap it around a bash script.
For the evaluation:
- NER Classification: bash ner_experiments.sh
- Text Classification & Sentiment Analysis: bash text_classification_all.sh

📄 License

The project is licensed under the cc - by - 4.0 license.

📚 Documentation

Annotations and Languages

Property	Details
Annotations Creators	crowdsourced
Language	amh, orm, lin, hau, ibo, kin, lug, luo, pcm, swa, wol, yor, bam, bbj, ewe, fon, mos, nya, sna, tsn, twi, xho, zul
Language Creators	crowdsourced
Multilinguality	monolingual
Pretty Name	afrolm - dataset
Size Categories	1M < n < 10M
Source Datasets	original
Tags	afrolm, active learning, language modeling, research papers, natural language processing, self - active learning
Task Categories	fill - mask
Task IDs	masked - language - modeling

Citation

@inproceedings{dossou-etal-2022-afrolm,
    title = "{A}fro{LM}: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 {A}frican Languages",
    author = "Dossou, Bonaventure F. P.  and
      Tonja, Atnafu Lambebo  and
      Yousuf, Oreen  and
      Osei, Salomey  and
      Oppong, Abigail  and
      Shode, Iyanuoluwa  and
      Awoyomi, Oluwabusayo Olufunke  and
      Emezue, Chris",
    booktitle = "Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.sustainlp-1.11",
    pages = "52--64"
}

We will share the official proceeding citation as soon as possible. Stay tuned, and if you have liked our work, give it a star.

Reach out

Do you have a question? Please create an issue and we will reach out as soon as possible.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご