🚀 afriberta_large
AfriBERTa large is a pre - trained multilingual language model. It can handle multiple African languages and shows good performance in downstream tasks like text classification and named entity recognition.
🚀 Quick Start
You can use this model with Transformers for any downstream task.
For example, assuming we want to finetune this model on a token classification task, we do the following:
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_large")
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_large")
>>> tokenizer.model_max_length = 512
✨ Features
- Multilingual Support: AfriBERTa large is a pretrained multilingual language model with around 126 million parameters. It was pretrained on 11 African languages namely - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá.
- Good Down - stream Performance: The model has been shown to obtain competitive downstream performances on text classification and Named Entity Recognition on several African languages, including those it was not pretrained on.
📦 Installation
No specific installation steps are provided in the original README.
💻 Usage Examples
Basic Usage
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_large")
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_large")
>>> tokenizer.model_max_length = 512
📚 Documentation
Model description
AfriBERTa large is a pretrained multilingual language model with around 126 million parameters.
The model has 10 layers, 6 attention heads, 768 hidden units and 3072 feed forward size.
The model was pretrained on 11 African languages namely - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá.
The model has been shown to obtain competitive downstream performances on text classification and Named Entity Recognition on several African languages, including those it was not pretrained on.
Intended uses & limitations
How to use
You can use this model with Transformers for any downstream task.
Limitations and bias
- This model is possibly limited by its training dataset which are majorly obtained from news articles from a specific span of time. Thus, it may not generalize well.
- This model is trained on very little data (less than 1 GB), hence it may not have seen enough data to learn very complex linguistic relations.
Training data
The model was trained on an aggregation of datasets from the BBC news website and Common Crawl.
Training procedure
For information on training procedures, please refer to the AfriBERTa paper or repository
🔧 Technical Details
The model has 10 layers, 6 attention heads, 768 hidden units and 3072 feed forward size.
📄 License
This project is licensed under the MIT license.
BibTeX entry and citation info
@inproceedings{ogueji-etal-2021-small,
title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages",
author = "Ogueji, Kelechi and
Zhu, Yuxin and
Lin, Jimmy",
booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.mrl-1.11",
pages = "116--126",
}
Information Table
Property |
Details |
Model Type |
Pretrained multilingual language model |
Training Data |
Aggregation of datasets from the BBC news website and Common Crawl |
Supported Languages |
Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya, Yorùbá |
License |
MIT |
Datasets |
castorini/afriberta - corpus |