kinyaRoberta-small Open-source Language Model - Practical Language Processing Based on Kinyarwanda Language Dataset

Kinyaroberta Small

Developed by jean-paul

This is a RoBERTa model pretrained on Kinyarwanda datasets using Masked Language Modeling (MLM) objective, with case-insensitive tokenization during pretraining.

Large Language Model

Transformers

#Kinyarwanda MLM #Small-scale pretraining #News-Wikipedia-Book corpus

Downloads 38

Release Time : 3/2/2022

Model Overview

This model is specifically optimized for Kinyarwanda, suitable for text infilling and language understanding tasks.

Model Features

Specialized for Kinyarwanda

Specifically trained for Kinyarwanda, enabling better understanding and generation of texts in this language.

Case-insensitive

The model does not distinguish between uppercase and lowercase during pretraining, improving handling of text variants.

Lightweight Architecture

Uses a 6-layer Transformer structure, suitable for environments with limited computational resources.

Model Capabilities

Text infilling

Language understanding

Kinyarwanda text processing

Use Cases

Text completion

Sentence auto-completion

Automatically fills in missing parts of sentences

Examples demonstrate the model's ability to reasonably predict missing words

Language learning

Kinyarwanda learning aid

Assists learners in understanding and using Kinyarwanda

🚀 KinyaRoBERTa - Pretrained Model for Kinyarwanda Language

A pre - trained model on the Kinyarwanda language dataset using masked language modeling (MLM). It offers capabilities for language - related tasks in Kinyarwanda.

🚀 Quick Start

The model can be used directly with the pipeline for masked language modeling or by directly using the transformer library to get features.

✨ Features

Kinyarwanda - Specific: Pretrained on a Kinyarwanda language dataset, making it suitable for Kinyarwanda language tasks.
Uncased Tokens: Pretrained with uncased tokens, so there's no distinction between cases (e.g., ikinyarwanda and Ikinyarwanda).

📦 Installation

No specific installation steps were provided in the original README. However, to use the model, you need to have the transformers library installed. You can install it using pip install transformers.

💻 Usage Examples

Basic Usage

The model can be used directly with the pipeline for masked language modeling as follows:

from transformers import pipeline
the_mask_pipe = pipeline(
    "fill - mask",
    model='jean - paul/kinyaRoberta - small',
    tokenizer='jean - paul/kinyaRoberta - small',
)

the_mask_pipe("Ejo ndikwiga nagize <mask> baje kunsura.")

[{'sequence': 'Ejo ndikwiga nagize amahirwe baje kunsura.', 'score': 0.3530674874782562, 'token': 1711, 'token_str': ' amahirwe'}, 
{'sequence': 'Ejo ndikwiga nagize ubwoba baje kunsura.', 'score': 0.2858319878578186, 'token': 2594, 'token_str': ' ubwoba'}, 
{'sequence': 'Ejo ndikwiga nagize ngo baje kunsura.', 'score': 0.032475441694259644, 'token': 396, 'token_str': ' ngo'}, 
{'sequence': 'Ejo ndikwiga nagize abana baje kunsura.', 'score': 0.029481062665581703, 'token': 739, 'token_str': ' abana'}, 
{'sequence': 'Ejo ndikwiga nagize abantu baje kunsura.', 'score': 0.016263306140899658, 'token': 500, 'token_str': ' abantu'}]

Advanced Usage

Direct use from the transformer library to get features using AutoModel:

from transformers import AutoTokenizer, AutoModelForMaskedLM
  
tokenizer = AutoTokenizer.from_pretrained("jean - paul/kinyaRoberta - small")

model = AutoModelForMaskedLM.from_pretrained("jean - paul/kinyaRoberta - small")

input_text = "Ejo ndikwiga nagize abashyitsi baje kunsura."
encoded_input = tokenizer(input_text, return_tensors='pt')
output = model(**encoded_input)

📚 Documentation

Model Description

A Pretrained model on the Kinyarwanda language dataset using a masked language modeling (MLM) objective. RoBerta model was first introduced in this paper. This KinyaRoBERTa model was pretrained with uncased tokens which means that no difference between for example ikinyarwanda and Ikinyarwanda.

Training Parameters

Dataset

The data set used has both sources from the new articles in Rwanda extracted from different new web pages, dumped Wikipedia files, and the books in Kinyarwanda. The sizes of the sources of data are 72 thousand new articles, three thousand dumped Wikipedia articles, and six books with more than a thousand pages.

Hyperparameters

The model was trained with the default configuration of RoBerta and Trainer from the Huggingface. However, due to some resource computation issues, we kept the number of transformer layers to 6.

💡 Usage Tip

We used the Huggingface implementations for pretraining RoBerta from scratch, both the RoBerta model and the classes needed to do it.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご