TswanaBert: An Open-Source Setswana Language Model for Setswana Content Processing and Understanding

Tswanabert

Developed by MoseliMotsoehli

A Tswana language model pretrained with the Masked Language Modeling (MLM) objective.

Large Language Model Other#Tswana Pretraining #Masked Language Modeling #Low-Resource Language Processing

Downloads 42

Release Time : 3/2/2022

Model Overview

Tswana BERT is a Transformer model pretrained in a self-supervised manner on a Tswana corpus, which masks portions of input vocabulary and uses byte-level tokenization to predict the masked content.

Model Features

Tswana-Specific

A BERT model specifically optimized and trained for the Tswana language.

Self-Supervised Learning

Pretrained using the Masked Language Modeling task.

Byte-Level Tokenization

Processes input text using byte-level tokenization.

Model Capabilities

Masked Word Prediction

Tswana Text Understanding

Downstream Task Fine-Tuning

Use Cases

Natural Language Processing

Text Completion

Predicts masked Tswana vocabulary.

Examples demonstrate accurate prediction of everyday Tswana phrases.

Language Model Fine-Tuning

Can serve as a base model for downstream NLP tasks.

🚀 TswanaBert

A pre-trained model on the Tswana language using a masked language modeling (MLM) objective, which can effectively handle Tswana language tasks.

🚀 Quick Start

TswanaBERT is a transformer model pre-trained on a corpus of Setswana in a self-supervised fashion. It masks part of the input words and trains to predict the masks using byte-level tokens.

✨ Features

Intended Uses

The model can be used for either masked language modeling or next-word prediction.
It can also be fine-tuned on a specific downstream NLP application.

Limitations

The model is trained on a relatively small collection of sestwana, mostly from news articles and creative writings, and so is not representative enough of the language as yet.

💻 Usage Examples

Basic Usage

>>> from transformers import pipeline
>>> from transformers import AutoTokenizer, AutoModelWithLMHead

>>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/TswanaBert")
>>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/TswanaBert")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker("Ntshopotse <mask> e godile.")

[{'score': 0.32749542593955994,
  'sequence': '<s>Ntshopotse setse e godile.</s>',
  'token': 538,
  'token_str': 'Ġsetse'},
 {'score': 0.060260992497205734,
  'sequence': '<s>Ntshopotse le e godile.</s>',
  'token': 270,
  'token_str': 'Ġle'},
 {'score': 0.058460816740989685,
  'sequence': '<s>Ntshopotse bone e godile.</s>',
  'token': 364,
  'token_str': 'Ġbone'},
 {'score': 0.05694682151079178,
  'sequence': '<s>Ntshopotse ga e godile.</s>',
  'token': 298,
  'token_str': 'Ġga'},
 {'score': 0.0565204992890358,
  'sequence': '<s>Ntshopotse, e godile.</s>',
  'token': 16,
  'token_str': ','}]

📚 Documentation

Training Data

The largest portion of this dataset (10k) sentences of text comes from the Leipzig Corpora Collection.
We added SABC news headlines collected by Marivate Vukosi, & Sefara Tshephisho, (2020), which are generously made available on zenoodo. This added 185 Tswana sentences to the corpus.
We also added 300 more sentences by scraping the following news sites and blogs that mostly originate in Botswana. We actively continue to expand the dataset.
- http://setswana.blogspot.com/
- https://omniglot.com/writing/tswana.php
- http://www.dailynews.gov.bw/
- http://www.mmegi.bw/index.php
- https://tsena.co.bw
- http://www.botswana.co.za/Cultural_Issues-travel/botswana-country-guide-en-route.html
- https://www.poemhunter.com/poem/2013-setswana/
- https://www.poemhunter.com/poem/ngwana-wa-mosetsana/

📄 License

BibTeX entry and citation info

@inproceedings{author = {Moseli Motsoehli},
  year={2020}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご