xlm-roberta-base-finetuned-ner-naija Open Source Model - Accurately Identify Named Entities in Nigerian Pidgin

Home

Xlm Roberta Base Finetuned Ner Naija

Developed by mbeukman

A named entity recognition model fine-tuned based on xlm-roberta-base, specifically optimized for Nigerian Pidgin

Sequence Labeling

Transformers

Other#African Language NER #Pidgin-specific #News Entity Recognition

Downloads 17

Release Time : 3/2/2022

Model Overview

This model is fine-tuned on the Nigerian Pidgin portion of the MasakhaNER dataset for identifying named entities (such as person names, locations, organizations, etc.) in text.

Model Features

African Language Optimization

Specifically fine-tuned for Nigerian Pidgin, filling the gap in NER models for African languages

Multi-category Recognition

Capable of identifying various entity types such as dates, person names, organizations, and geographical locations

Efficient Training

Fine-tuning can be completed in just 10-30 minutes on a single NVIDIA RTX3090 GPU

Model Capabilities

Text Entity Recognition

Multi-category Entity Classification

African Language Processing

Use Cases

NLP Research

Interpretability Research

Used to study the performance of cross-language models on African languages

Transfer Learning Experiments

Serves as a base model for transferring NER tasks to other African languages

🚀 xlm-roberta-base-finetuned-ner-naija

This is a token classification (specifically NER) model fine-tuned on the Nigerian Pidgin part of the MasakhaNER dataset, aiming to contribute to NLP research.

🚀 Quick Start

To use this model (or others), you can do the following, just changing the model name (source):

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = 'mbeukman/xlm-roberta-base-finetuned-ner-naija'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Mixed Martial Arts joinbodi , Ultimate Fighting Championship , UFC don decide say dem go enta back di octagon on Saturday , 9 May , for Jacksonville , Florida ."

ner_results = nlp(example)
print(ner_results)

✨ Features

Transformer-based: This model is transformer based and was fine-tuned on the MasakhaNER dataset.
Multi-metric evaluation: The main metric used is the aggregate F1 score for all NER categories, and other metrics like precision and recall are also considered.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = 'mbeukman/xlm-roberta-base-finetuned-ner-naija'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Mixed Martial Arts joinbodi , Ultimate Fighting Championship , UFC don decide say dem go enta back di octagon on Saturday , 9 May , for Jacksonville , Florida ."

ner_results = nlp(example)
print(ner_results)

📚 Documentation

About

This model is a token classification (specifically NER) model that fine-tuned xlm-roberta-base on the MasakhaNER dataset, specifically the Nigerian Pidgin part.

The model was fine-tuned for 50 epochs, with a maximum sequence length of 200, 32 batch size, 5e-5 learning rate. This process was repeated 5 times (with different random seeds), and this uploaded model performed the best out of those 5 seeds (aggregate F1 on test set).

This model was fine-tuned by Michael Beukman while doing a project at the University of the Witwatersrand, Johannesburg. This is version 1, as of 20 November 2021.

Contact & More information

For more information about the models, including training scripts, detailed results and further resources, you can visit the the main Github repository. You can contact the author by filing an issue on this repository.

Training Resources

Fine-tuning each model on the NER dataset took between 10 and 30 minutes, and was performed on a NVIDIA RTX3090 GPU. To use a batch size of 32, at least 14GB of GPU memory was required, although it was just possible to fit these models in around 6.5GB's of VRAM when using a batch size of 1.

Data

The train, evaluation and test datasets were taken directly from the MasakhaNER Github repository, with minimal to no preprocessing, as the original dataset is already of high quality.

The motivation for the use of this data is that it is the "first large, publicly available, high quality dataset for named entity recognition (NER) in ten African languages" (source).

Intended Use

This model is intended to be used for NLP research into e.g. interpretability or transfer learning. Using this model in production is not supported, as generalisability and downright performance is limited.

Limitations

This model was only trained on one (relatively small) dataset, covering one task (NER) in one domain (news articles) and in a set span of time. The results may not generalise, and the model may perform badly, or in an unfair / biased way if used on other tasks.

This model's limitations can also include being biased towards the hegemonic viewpoint of most of its training data, being ungrounded and having subpar results on other languages (possibly due to unbalanced training data).

Privacy & Ethical Considerations

The data comes from only publicly available news sources, the only available data should cover public figures and those that agreed to be reported on.

No explicit ethical considerations or adjustments were made during fine-tuning of this model.

Metrics

The language adaptive models achieve (mostly) superior performance over starting with xlm-roberta-base. The main metric was the aggregate F1 score for all NER categories.

These metrics are on the test set for MasakhaNER, so the data distribution is similar to the training set, so these results do not directly indicate how well these models generalise.

Caveats and Recommendations

In general, this model performed worse on the 'date' category compared to others, so if dates are a critical factor, then that might need to be taken into account and addressed, by for example collecting and annotating more data.

Model Structure

This model can predict the following label for a token (source):

Abbreviation	Description
O	Outside of a named entity
B-DATE	Beginning of a DATE entity right after another DATE entity
I-DATE	DATE entity
B-PER	Beginning of a person’s name right after another person’s name
I-PER	Person’s name
B-ORG	Beginning of an organisation right after another organisation
I-ORG	Organisation
B-LOC	Beginning of a location right after another location
I-LOC	Location

Model Name	Staring point	Evaluation / Fine-tune Language	F1	Precision	Recall	F1 (DATE)	F1 (LOC)	F1 (ORG)	F1 (PER)
xlm-roberta-base-finetuned-ner-naija (This model)	base	pcm	88.89	88.13	89.66	92.00	87.00	82.00	94.00
xlm-roberta-base-finetuned-naija-finetuned-ner-naija	pcm	pcm	88.06	87.04	89.12	90.00	88.00	81.00	92.00
xlm-roberta-base-finetuned-swahili-finetuned-ner-naija	swa	pcm	89.12	87.84	90.42	90.00	89.00	82.00	94.00

🔧 Technical Details

The model was fine-tuned for 50 epochs, with a maximum sequence length of 200, 32 batch size, 5e-5 learning rate. This process was repeated 5 times (with different random seeds), and the uploaded model performed the best out of those 5 seeds (aggregate F1 on test set).

📄 License

This model is licensed under the Apache License, Version 2.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご