Model Overview
Model Features
Model Capabilities
Use Cases
🚀 xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili
This is a token classification (specifically NER) model that fine-tunes xlm-roberta-base-finetuned-luganda on the Swahili part of the MasakhaNER dataset, offering solutions for named entity recognition tasks.
🚀 Quick Start
To use this model (or others), you can do the following, just changing the model name (source):
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = 'mbeukman/xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Wizara ya afya ya Tanzania imeripoti Jumatatu kuwa , watu takriban 14 zaidi wamepata maambukizi ya Covid - 19 ."
ner_results = nlp(example)
print(ner_results)
✨ Features
- Transformer-based: This model is built on a transformer architecture and fine-tuned on the MasakhaNER dataset, which contains news articles in 10 different African languages.
- Fine-tuning Details: It was fine-tuned for 50 epochs, with a maximum sequence length of 200, a batch size of 32, and a learning rate of 5e-5. The process was repeated 5 times with different random seeds, and the uploaded model had the best aggregate F1 on the test set.
- License: The model is licensed under the Apache License, Version 2.0.
📦 Installation
No specific installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = 'mbeukman/xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Wizara ya afya ya Tanzania imeripoti Jumatatu kuwa , watu takriban 14 zaidi wamepata maambukizi ya Covid - 19 ."
ner_results = nlp(example)
print(ner_results)
Advanced Usage
No advanced usage examples are provided in the original document, so this part is skipped.
📚 Documentation
About
This model is based on a transformer and fine-tuned on the MasakhaNER dataset. It is a named entity recognition dataset with mostly news articles in 10 different African languages. The model was fine-tuned by Michael Beukman during a project at the University of the Witwatersrand, Johannesburg. This is version 1 as of 20 November 2021.
Contact & More information
For more information about the models, including training scripts, detailed results, and further resources, you can visit the main Github repository. You can contact the author by filing an issue on this repository.
Training Resources
Fine-tuning each model on the NER dataset took between 10 and 30 minutes and was performed on a NVIDIA RTX3090 GPU. To use a batch size of 32, at least 14GB of GPU memory was required. It was possible to fit these models in around 6.5GB of VRAM when using a batch size of 1.
Data
The train, evaluation, and test datasets were taken directly from the MasakhaNER Github repository with minimal to no preprocessing. The data is of high quality, and the motivation for using it is that it is the "first large, publicly available, high - quality dataset for named entity recognition (NER) in ten African languages" (source).
Intended Use
This model is intended for NLP research into areas such as interpretability or transfer learning. Using it in production is not supported due to limited generalisability and performance.
Limitations
- Training Scope: The model was only trained on one (relatively small) dataset, covering one task (NER) in one domain (news articles) and in a set span of time. Results may not generalise, and it may perform poorly or in a biased way on other tasks.
- Starting Point Limitations: Since it uses xlm - roberta - base as a starting point, it may have limitations such as being biased towards the hegemonic viewpoint of most of its training data, being ungrounded, and having subpar results on other languages.
- Entity Recognition Issues: As shown by Adelani et al. (2021), the model struggles with entities longer than 3 words and those not in the training data.
- Lack of Verification: The model has not been verified in practice, and more subtle problems may arise.
Privacy & Ethical Considerations
The data comes from publicly available news sources, covering public figures and those who agreed to be reported on. No explicit ethical considerations or adjustments were made during fine - tuning.
Metrics
The main metric was the aggregate F1 score for all NER categories. These metrics are on the test set for MasakhaNER, so the data distribution is similar to the training set, and the results do not directly indicate how well the models generalise. There is large variation in transfer results when starting from different seeds, indicating that the fine - tuning process for transfer might be unstable.
Caveats and Recommendations
The model performed worse on the 'date' category compared to others. If dates are a critical factor, more data may need to be collected and annotated.
Model Structure
Here are some performance details on this specific model, compared to others trained. All metrics were calculated on the test set, and the seed was chosen to give the best overall F1 score. The first three result columns are averaged over all categories, and the latter 4 provide performance broken down by category.
This model can predict the following labels for a token (source):
Abbreviation | Description |
---|---|
O | Outside of a named entity |
B - DATE | Beginning of a DATE entity right after another DATE entity |
I - DATE | DATE entity |
B - PER | Beginning of a person’s name right after another person’s name |
I - PER | Person’s name |
B - ORG | Beginning of an organisation right after another organisation |
I - ORG | Organisation |
B - LOC | Beginning of a location right after another location |
I - LOC | Location |
Model Name | Staring point | Evaluation / Fine - tune Language | F1 | Precision | Recall | F1 (DATE) | F1 (LOC) | F1 (ORG) | F1 (PER) |
---|---|---|---|---|---|---|---|---|---|
[xlm - roberta - base - finetuned - luganda - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - luganda - finetuned - ner - swahili) (This model) | [lug](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - luganda) | swa | 88.93 | 87.64 | 90.25 | 83.00 | 92.00 | 79.00 | 95.00 |
[xlm - roberta - base - finetuned - hausa - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - hausa - finetuned - ner - swahili) | [hau](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - hausa) | swa | 88.36 | 86.95 | 89.82 | 86.00 | 91.00 | 77.00 | 94.00 |
[xlm - roberta - base - finetuned - igbo - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - igbo - finetuned - ner - swahili) | [ibo](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - igbo) | swa | 87.75 | 86.55 | 88.97 | 85.00 | 92.00 | 77.00 | 91.00 |
[xlm - roberta - base - finetuned - kinyarwanda - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - kinyarwanda - finetuned - ner - swahili) | [kin](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - kinyarwanda) | swa | 87.26 | 85.15 | 89.48 | 83.00 | 91.00 | 75.00 | 93.00 |
[xlm - roberta - base - finetuned - luo - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - luo - finetuned - ner - swahili) | [luo](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - luo) | swa | 87.93 | 86.91 | 88.97 | 83.00 | 91.00 | 76.00 | 94.00 |
[xlm - roberta - base - finetuned - naija - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - naija - finetuned - ner - swahili) | [pcm](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - naija) | swa | 87.26 | 85.15 | 89.48 | 83.00 | 91.00 | 75.00 | 93.00 |
[xlm - roberta - base - finetuned - swahili - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - swahili - finetuned - ner - swahili) | [swa](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - swahili) | swa | 90.36 | 88.59 | 92.20 | 86.00 | 93.00 | 79.00 | 96.00 |
[xlm - roberta - base - finetuned - wolof - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - wolof - finetuned - ner - swahili) | [wol](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - wolof) | swa | 87.80 | 86.50 | 89.14 | 86.00 | 90.00 | 78.00 | 93.00 |
[xlm - roberta - base - finetuned - yoruba - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - yoruba - finetuned - ner - swahili) | [yor](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - yoruba) | swa | 87.73 | 86.67 | 88.80 | 85.00 | 91.00 | 75.00 | 93.00 |
[xlm - roberta - base - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - ner - swahili) | [base](https://huggingface.co/xlm - roberta - base) | swa | 88.71 | 86.84 | 90.67 | 83.00 | 91.00 | 79.00 | 95.00 |
📄 License
This model is licensed under the Apache License, Version 2.0.






