XLM-RoBERTa Open-Source Named Entity Recognition Model - Accurately Identify Swahili Entity Information

Xlm Roberta Base Finetuned Luganda Finetuned Ner Swahili

Developed by mbeukman

This is a named entity recognition model based on the XLM-RoBERTa model, fine-tuned on the Swahili portion of the MasakhaNER dataset.

Sequence Labeling

Transformers

Other#African Language NER #Cross-lingual Transfer #News Entity Recognition

Downloads 17

Release Time : 3/2/2022

Model Overview

This model is specifically designed for Swahili named entity recognition tasks, capable of identifying entities such as dates, locations, organizations, and personal names in text.

Model Features

Cross-lingual Transfer Learning

XLM-RoBERTa model fine-tuned on Luganda further fine-tuned for Swahili NER tasks

High Performance

Achieves an F1 score of 88.93 on Swahili NER tasks

Multi-category Recognition

Capable of identifying various entity types such as dates, locations, organizations, and personal names

Model Capabilities

Swahili Text Analysis

Named Entity Recognition

Multi-category Entity Annotation

Use Cases

NLP Research

Interpretability Research

Study the model's performance and interpretability on African languages

Transfer Learning Research

Explore the effectiveness of cross-lingual transfer learning

Information Extraction

News Analysis

Extract key entity information from Swahili news

🚀 xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili

This is a token classification (specifically NER) model that fine-tunes xlm-roberta-base-finetuned-luganda on the Swahili part of the MasakhaNER dataset, offering solutions for named entity recognition tasks.

🚀 Quick Start

To use this model (or others), you can do the following, just changing the model name (source):

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = 'mbeukman/xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Wizara ya afya ya Tanzania imeripoti Jumatatu kuwa , watu takriban 14 zaidi wamepata maambukizi ya Covid - 19 ."

ner_results = nlp(example)
print(ner_results)

✨ Features

Transformer-based: This model is built on a transformer architecture and fine-tuned on the MasakhaNER dataset, which contains news articles in 10 different African languages.
Fine-tuning Details: It was fine-tuned for 50 epochs, with a maximum sequence length of 200, a batch size of 32, and a learning rate of 5e-5. The process was repeated 5 times with different random seeds, and the uploaded model had the best aggregate F1 on the test set.
License: The model is licensed under the Apache License, Version 2.0.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = 'mbeukman/xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Wizara ya afya ya Tanzania imeripoti Jumatatu kuwa , watu takriban 14 zaidi wamepata maambukizi ya Covid - 19 ."

ner_results = nlp(example)
print(ner_results)

Advanced Usage

No advanced usage examples are provided in the original document, so this part is skipped.

📚 Documentation

About

This model is based on a transformer and fine-tuned on the MasakhaNER dataset. It is a named entity recognition dataset with mostly news articles in 10 different African languages. The model was fine-tuned by Michael Beukman during a project at the University of the Witwatersrand, Johannesburg. This is version 1 as of 20 November 2021.

Contact & More information

For more information about the models, including training scripts, detailed results, and further resources, you can visit the main Github repository. You can contact the author by filing an issue on this repository.

Training Resources

Fine-tuning each model on the NER dataset took between 10 and 30 minutes and was performed on a NVIDIA RTX3090 GPU. To use a batch size of 32, at least 14GB of GPU memory was required. It was possible to fit these models in around 6.5GB of VRAM when using a batch size of 1.

Data

The train, evaluation, and test datasets were taken directly from the MasakhaNER Github repository with minimal to no preprocessing. The data is of high quality, and the motivation for using it is that it is the "first large, publicly available, high - quality dataset for named entity recognition (NER) in ten African languages" (source).

Intended Use

This model is intended for NLP research into areas such as interpretability or transfer learning. Using it in production is not supported due to limited generalisability and performance.

Limitations

Training Scope: The model was only trained on one (relatively small) dataset, covering one task (NER) in one domain (news articles) and in a set span of time. Results may not generalise, and it may perform poorly or in a biased way on other tasks.
Starting Point Limitations: Since it uses xlm - roberta - base as a starting point, it may have limitations such as being biased towards the hegemonic viewpoint of most of its training data, being ungrounded, and having subpar results on other languages.
Entity Recognition Issues: As shown by Adelani et al. (2021), the model struggles with entities longer than 3 words and those not in the training data.
Lack of Verification: The model has not been verified in practice, and more subtle problems may arise.

Privacy & Ethical Considerations

The data comes from publicly available news sources, covering public figures and those who agreed to be reported on. No explicit ethical considerations or adjustments were made during fine - tuning.

Metrics

The main metric was the aggregate F1 score for all NER categories. These metrics are on the test set for MasakhaNER, so the data distribution is similar to the training set, and the results do not directly indicate how well the models generalise. There is large variation in transfer results when starting from different seeds, indicating that the fine - tuning process for transfer might be unstable.

Caveats and Recommendations

The model performed worse on the 'date' category compared to others. If dates are a critical factor, more data may need to be collected and annotated.

Model Structure

Here are some performance details on this specific model, compared to others trained. All metrics were calculated on the test set, and the seed was chosen to give the best overall F1 score. The first three result columns are averaged over all categories, and the latter 4 provide performance broken down by category.

This model can predict the following labels for a token (source):

Abbreviation	Description
O	Outside of a named entity
B - DATE	Beginning of a DATE entity right after another DATE entity
I - DATE	DATE entity
B - PER	Beginning of a person’s name right after another person’s name
I - PER	Person’s name
B - ORG	Beginning of an organisation right after another organisation
I - ORG	Organisation
B - LOC	Beginning of a location right after another location
I - LOC	Location

Model Name	Staring point	Evaluation / Fine - tune Language	F1	Precision	Recall	F1 (DATE)	F1 (LOC)	F1 (ORG)	F1 (PER)
[xlm - roberta - base - finetuned - luganda - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - luganda - finetuned - ner - swahili) (This model)	[lug](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - luganda)	swa	88.93	87.64	90.25	83.00	92.00	79.00	95.00
[xlm - roberta - base - finetuned - hausa - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - hausa - finetuned - ner - swahili)	[hau](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - hausa)	swa	88.36	86.95	89.82	86.00	91.00	77.00	94.00
[xlm - roberta - base - finetuned - igbo - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - igbo - finetuned - ner - swahili)	[ibo](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - igbo)	swa	87.75	86.55	88.97	85.00	92.00	77.00	91.00
[xlm - roberta - base - finetuned - kinyarwanda - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - kinyarwanda - finetuned - ner - swahili)	[kin](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - kinyarwanda)	swa	87.26	85.15	89.48	83.00	91.00	75.00	93.00
[xlm - roberta - base - finetuned - luo - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - luo - finetuned - ner - swahili)	[luo](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - luo)	swa	87.93	86.91	88.97	83.00	91.00	76.00	94.00
[xlm - roberta - base - finetuned - naija - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - naija - finetuned - ner - swahili)	[pcm](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - naija)	swa	87.26	85.15	89.48	83.00	91.00	75.00	93.00
[xlm - roberta - base - finetuned - swahili - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - swahili - finetuned - ner - swahili)	[swa](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - swahili)	swa	90.36	88.59	92.20	86.00	93.00	79.00	96.00
[xlm - roberta - base - finetuned - wolof - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - wolof - finetuned - ner - swahili)	[wol](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - wolof)	swa	87.80	86.50	89.14	86.00	90.00	78.00	93.00
[xlm - roberta - base - finetuned - yoruba - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - yoruba - finetuned - ner - swahili)	[yor](https://huggingface.co/Davlan/xlm - roberta - base - finetuned - yoruba)	swa	87.73	86.67	88.80	85.00	91.00	75.00	93.00
[xlm - roberta - base - finetuned - ner - swahili](https://huggingface.co/mbeukman/xlm - roberta - base - finetuned - ner - swahili)	[base](https://huggingface.co/xlm - roberta - base)	swa	88.71	86.84	90.67	83.00	91.00	79.00	95.00

📄 License

This model is licensed under the Apache License, Version 2.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご