Afriberta_small Open-source Multilingual Model - Free Deployment Supporting Text Classification and Recognition of 11 African Languages

Afriberta Small

Developed by castorini

AfriBERTa Small is a 97-million-parameter multilingual pretrained model supporting 11 African languages, suitable for tasks like text classification and named entity recognition.

Large Language Model

Transformers

#African multilingual processing #Low-resource optimization #News text analysis

Downloads 160

Release Time : 3/2/2022

Model Overview

This model is a multilingual pretrained model optimized for African languages, excelling in low-resource language environments and particularly suitable for NLP tasks involving African languages.

Model Features

Multilingual support

Specially optimized for 11 African languages including low-resource languages like Oromo and Amharic

Lightweight design

Compact model with only 97 million parameters, suitable for deployment in resource-constrained environments

Cross-lingual generalization

Demonstrates competitive advantages even on African languages not included in pretraining

Model Capabilities

Text classification

Named entity recognition

Multilingual text processing

Use Cases

News analysis

African news classification

Classifying multilingual news content from Africa

Performs well on BBC News data

Language processing

Low-resource language NER

Named entity recognition for low-resource African languages

Outperforms similar models on untrained languages

🚀 afriberta_small

AfriBERTa small is a pretrained multilingual language model that can achieve competitive performance on downstream tasks in multiple African languages.

🚀 Quick Start

You can use this model with Transformers for any downstream task. For example, assuming we want to finetune this model on a token classification task, we do the following:

>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_small")
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_small")
# we have to manually set the model max length because it is an imported trained sentencepiece model, which huggingface does not properly support right now
>>> tokenizer.model_max_length = 512

✨ Features

Multilingual Support: AfriBERTa small supports 11 African languages, including Afaan Oromoo, Amharic, Gahuza, Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya, and Yorùbá, as well as multilingual scenarios.
Competitive Performance: The model has shown competitive downstream performances on text classification and Named Entity Recognition in several African languages, even on those it was not pretrained on.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_small")
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_small")
# we have to manually set the model max length because it is an imported trained sentencepiece model, which huggingface does not properly support right now
>>> tokenizer.model_max_length = 512

📚 Documentation

Model description

AfriBERTa small is a pretrained multilingual language model with around 97 million parameters. The model has 4 layers, 6 attention heads, 768 hidden units and 3072 feed forward size. The model was pretrained on 11 African languages namely - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá.

Intended uses & limitations

How to use

You can use this model with Transformers for any downstream task.

Limitations and bias

This model is possibly limited by its training dataset which are majorly obtained from news articles from a specific span of time. Thus, it may not generalize well.
This model is trained on very little data (less than 1 GB), hence it may not have seen enough data to learn very complex linguistic relations.

Training data

The model was trained on an aggregation of datasets from the BBC news website and Common Crawl.

Training procedure

For information on training procedures, please refer to the AfriBERTa paper or repository

BibTeX entry and citation info

@inproceedings{ogueji-etal-2021-small,
    title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages",
    author = "Ogueji, Kelechi  and
      Zhu, Yuxin  and
      Lin, Jimmy",
    booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.mrl-1.11",
    pages = "116--126",
}

📄 License

No license information is provided in the original README.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご