NusaBERT-base Open-source Multilingual Encoder - Supports Processing of 13 Indonesian Languages

Nusabert Base

Developed by LazarusNLP

NusaBERT Base Version is a multilingual encoder language model based on the BERT architecture, supporting 13 Indonesian regional languages and pretrained on multiple open-source corpora.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Indonesian Archipelago Multilingual #Culturally Sensitive BERT #Low-Resource Language Optimization

Downloads 68

Release Time : 2/21/2024

Model Overview

NusaBERT is a multilingual encoder language model based on the BERT architecture, specifically optimized for 13 languages in Indonesia and surrounding regions, suitable for various natural language processing tasks.

Model Features

Multilingual Support

Supports 13 languages in Indonesia and surrounding regions, including mainstream languages and dialects.

Large-Scale Pretraining

Pretrained on a diverse corpus of approximately 16 billion tokens.

Optimized Performance

Achieves an accuracy of 0.6866 and a perplexity of 4.4266 on the held-out test set.

Model Capabilities

Text Understanding

Language Modeling

Multilingual Processing

Use Cases

Natural Language Processing

Text Classification

Classify texts in multiple languages from the Indonesian region.

Named Entity Recognition

Identify entities in texts from the Indonesian region.

🚀 NusaBERT Base

NusaBERT Base is a multilingual encoder-based language model founded on the BERT architecture. It addresses the need for a multilingual model in various Indonesian languages, offering high accuracy and low loss on relevant corpora.

🚀 Quick Start

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_checkpoint = "LazarusNLP/NusaBERT-base"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

✨ Features

Multilingual Support: Covers multiple languages including Indonesian, Acehnese, Balinese, and more.
High Performance: Achieved eval_accuracy of 0.6866, eval_loss of 1.4876, and perplexity of 4.4266 on a held - out subset of the corpus.
Open - Source Training: Continued pre - training on open - source corpora like sabilmakbar/indo_wiki, [acul3/KoPI - NLLB](https://huggingface.co/datasets/acul3/KoPI - NLLB), and uonlp/CulturaX.

📦 Installation

This model is used within the 🤗Transformers PyTorch framework. You can install the necessary libraries using the following command:

pip install transformers datasets tokenizers torch

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_checkpoint = "LazarusNLP/NusaBERT-base"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

input_text = "Your input text here"
inputs = tokenizer(input_text, return_tensors='pt')
outputs = model(**inputs)

📚 Documentation

Model Detail

Property	Details
Developed by	LazarusNLP
Finetuned from	IndoBERT base p1
Model Type	Encoder-based BERT language model
Language(s)	Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum
License	Apache 2.0
Contact	LazarusNLP

Training Datasets

Around 16B tokens from the following corpora were used during pre - training:

Indonesian Wikipedia Data Repository
[KoPI - NLLB (Korpus Perayapan Indonesia)](https://huggingface.co/datasets/acul3/KoPI - NLLB)
Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 256
eval_batch_size: 256
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 24000
training_steps: 500000

Framework versions

Transformers 4.37.2
Pytorch 2.2.0+cu118
Datasets 2.17.1
Tokenizers 0.15.1

🔧 Technical Details

This model was trained using the 🤗Transformers PyTorch framework. All training was done on an NVIDIA H100 GPU. The model is a continued pre - trained version of IndoBERT base p1 on specific open - source corpora, which helps it achieve good performance on multiple languages.

📄 License

[LazarusNLP/NusaBERT - base](https://huggingface.co/LazarusNLP/NusaBERT - base) is released under Apache 2.0 license.

Credits

NusaBERT Base is developed with love by:

Citation

@misc{wongso2024nusabert,
  title={NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural}, 
  author={Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo},
  year={2024},
  eprint={2403.01817},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご