🚀 NusaBERT Base
NusaBERT Base is a multilingual encoder-based language model founded on the BERT architecture. It addresses the need for a multilingual model in various Indonesian languages, offering high accuracy and low loss on relevant corpora.
🚀 Quick Start
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_checkpoint = "LazarusNLP/NusaBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
✨ Features
- Multilingual Support: Covers multiple languages including Indonesian, Acehnese, Balinese, and more.
- High Performance: Achieved
eval_accuracy
of 0.6866, eval_loss
of 1.4876, and perplexity
of 4.4266 on a held - out subset of the corpus.
- Open - Source Training: Continued pre - training on open - source corpora like sabilmakbar/indo_wiki, [acul3/KoPI - NLLB](https://huggingface.co/datasets/acul3/KoPI - NLLB), and uonlp/CulturaX.
📦 Installation
This model is used within the 🤗Transformers PyTorch framework. You can install the necessary libraries using the following command:
pip install transformers datasets tokenizers torch
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_checkpoint = "LazarusNLP/NusaBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
input_text = "Your input text here"
inputs = tokenizer(input_text, return_tensors='pt')
outputs = model(**inputs)
📚 Documentation
Model Detail
Property |
Details |
Developed by |
LazarusNLP |
Finetuned from |
IndoBERT base p1 |
Model Type |
Encoder-based BERT language model |
Language(s) |
Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum |
License |
Apache 2.0 |
Contact |
LazarusNLP |
Training Datasets
Around 16B tokens from the following corpora were used during pre - training:
Training Hyperparameters
The following hyperparameters were used during training:
learning_rate
: 0.0003
train_batch_size
: 256
eval_batch_size
: 256
seed
: 42
optimizer
: Adam with betas=(0.9,0.999)
and epsilon = 1e-08
lr_scheduler_type
: linear
lr_scheduler_warmup_steps
: 24000
training_steps
: 500000
Framework versions
- Transformers 4.37.2
- Pytorch 2.2.0+cu118
- Datasets 2.17.1
- Tokenizers 0.15.1
🔧 Technical Details
This model was trained using the 🤗Transformers PyTorch framework. All training was done on an NVIDIA H100 GPU. The model is a continued pre - trained version of IndoBERT base p1 on specific open - source corpora, which helps it achieve good performance on multiple languages.
📄 License
[LazarusNLP/NusaBERT - base](https://huggingface.co/LazarusNLP/NusaBERT - base) is released under Apache 2.0 license.
Credits
NusaBERT Base is developed with love by:
Citation
@misc{wongso2024nusabert,
title={NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural},
author={Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo},
year={2024},
eprint={2403.01817},
archivePrefix={arXiv},
primaryClass={cs.CL}
}