mT5_base_yoruba_adr Open-source Model - Achieving Practical Function of Automatic Tone Restoration in Yoruba

Home

Mt5 Base Yoruba Adr

Developed by Davlan

A Yoruba automatic tone restoration model fine-tuned on mT5-base, trained on JW300 and Menyo-20k datasets

Machine Translation

Transformers

#Yoruba tone restoration #Multilingual text processing #Low-resource language models

Downloads 14

Release Time : 3/2/2022

Model Overview

This model is used to add correct diacritical marks to Yoruba text, achieving state-of-the-art performance levels

Model Features

State-of-the-art tone restoration performance

Achieves current best performance on Yoruba text tone restoration tasks

Multi-dataset training

Trained using both JW300 and Menyo-20k datasets

Based on mT5 architecture

Utilizes the powerful mT5-base model for fine-tuning

Model Capabilities

Yoruba text tone restoration

Natural language processing

Use Cases

Language processing

Yoruba text standardization

Adds correct diacritical marks to unmarked Yoruba text

Improves text readability and accuracy

Linguistic research assistance

Helps linguists study Yoruba tone patterns

🚀 mT5_base_yoruba_adr

mT5_base_yoruba_adr is an automatic diacritics restoration model for the Yorùbá language. It's based on a fine - tuned mT5 - base model and can achieve state - of - the - art performance in adding correct diacritics or tonal marks to Yorùbá texts.

🚀 Quick Start

You can use this model with Transformers pipeline for ADR.

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("")
model = AutoModelForTokenClassification.from_pretrained("")
nlp = pipeline("", model=model, tokenizer=tokenizer)
example = "Emir of Kano turban Zhang wey don spend 18 years for Nigeria"
ner_results = nlp(example)
print(ner_results)

✨ Features

State - of - the - art Performance: Achieves excellent results in adding correct diacritics or tonal marks to Yorùbá texts.
Fine - tuned Model: Based on the mT5 - base model, fine - tuned on specific Yorùbá corpora.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("")
model = AutoModelForTokenClassification.from_pretrained("")
nlp = pipeline("", model=model, tokenizer=tokenizer)
example = "Emir of Kano turban Zhang wey don spend 18 years for Nigeria"
ner_results = nlp(example)
print(ner_results)

📚 Documentation

Intended uses & limitations

How to use

The model can be used with the Transformers pipeline for automatic diacritics restoration (ADR).

Limitations and bias

This model is limited by its training dataset of entity - annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains.

Training data

This model was fine - tuned on the JW300 Yorùbá corpus and the Menyo - 20k dataset.

Training procedure

This model was trained on a single NVIDIA V100 GPU.

Eval results on Test set (BLEU score)

64.63 BLEU on Global Voices test set
70.27 BLEU on Menyo - 20k test set

BibTeX entry and citation info

By Jesujoba Alabi and David Adelani

🔧 Technical Details

The model is a fine - tuned mT5_base model. It was trained on a single NVIDIA V100 GPU and fine - tuned on specific Yorùbá corpora, which contributes to its performance in automatic diacritics restoration for the Yorùbá language.

📄 License

No license information is provided in the original document, so this section is skipped.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご