AfriBERTa large Open-source Multilingual Model - Free Processing of Text Classification and Named Entity Recognition for 11 African Languages

Afriberta Large

Developed by castorini

AfriBERTa large is a pre-trained multilingual model with approximately 126 million parameters, supporting 11 African languages, suitable for tasks such as text classification and named entity recognition.

Large Language Model

Transformers

OtherOpen Source License:MIT #African multilingual processing #Low-resource language optimization #News text analysis

Downloads 857

Release Time : 3/2/2022

Model Overview

AfriBERTa large is a multilingual pre-trained model specifically optimized for African languages, excelling in various downstream tasks, particularly suitable for low-resource language processing.

Model Features

Multilingual support

Specifically optimized for 11 African languages, including low-resource languages.

Low-resource adaptation

Performs well even with limited training data.

Cross-lingual generalization

Demonstrates excellent performance on African languages not included in pre-training.

Model Capabilities

Text classification

Named entity recognition

Multilingual text processing

Use Cases

Natural language processing

African language news classification

Classify multilingual news articles in African languages

Excellent performance on low-resource languages

African language entity recognition

Identify named entities such as person and location names in African language texts

Strong cross-lingual generalization capability

🚀 afriberta_large

AfriBERTa large is a pre - trained multilingual language model. It can handle multiple African languages and shows good performance in downstream tasks like text classification and named entity recognition.

🚀 Quick Start

You can use this model with Transformers for any downstream task. For example, assuming we want to finetune this model on a token classification task, we do the following:

>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_large")
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_large")
# we have to manually set the model max length because it is an imported sentencepiece model, which huggingface does not properly support right now
>>> tokenizer.model_max_length = 512

✨ Features

Multilingual Support: AfriBERTa large is a pretrained multilingual language model with around 126 million parameters. It was pretrained on 11 African languages namely - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá.
Good Down - stream Performance: The model has been shown to obtain competitive downstream performances on text classification and Named Entity Recognition on several African languages, including those it was not pretrained on.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_large")
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_large")
# we have to manually set the model max length because it is an imported sentencepiece model, which huggingface does not properly support right now
>>> tokenizer.model_max_length = 512

📚 Documentation

Model description

AfriBERTa large is a pretrained multilingual language model with around 126 million parameters. The model has 10 layers, 6 attention heads, 768 hidden units and 3072 feed forward size. The model was pretrained on 11 African languages namely - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá. The model has been shown to obtain competitive downstream performances on text classification and Named Entity Recognition on several African languages, including those it was not pretrained on.

Intended uses & limitations

How to use

You can use this model with Transformers for any downstream task.

Limitations and bias

This model is possibly limited by its training dataset which are majorly obtained from news articles from a specific span of time. Thus, it may not generalize well.
This model is trained on very little data (less than 1 GB), hence it may not have seen enough data to learn very complex linguistic relations.

Training data

The model was trained on an aggregation of datasets from the BBC news website and Common Crawl.

Training procedure

For information on training procedures, please refer to the AfriBERTa paper or repository

🔧 Technical Details

The model has 10 layers, 6 attention heads, 768 hidden units and 3072 feed forward size.

📄 License

This project is licensed under the MIT license.

BibTeX entry and citation info

@inproceedings{ogueji-etal-2021-small,
    title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages",
    author = "Ogueji, Kelechi  and
      Zhu, Yuxin  and
      Lin, Jimmy",
    booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.mrl-1.11",
    pages = "116--126",
}

Information Table

Property	Details
Model Type	Pretrained multilingual language model
Training Data	Aggregation of datasets from the BBC news website and Common Crawl
Supported Languages	Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya, Yorùbá
License	MIT
Datasets	castorini/afriberta - corpus

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご