indonesian-roberta-base: An Open-Source Indonesian Language Model for Text Processing with Over 60% Validation Accuracy

Indonesian Roberta Base

Developed by flax-community

Indonesian masked language model based on RoBERTa architecture, trained on the OSCAR corpus with a validation accuracy of 62.45%

Large Language Model OtherOpen Source License:MIT #Indonesian Pretraining #Masked Language Model #Trained with Flax Framework

Downloads 1,013

Release Time : 3/2/2022

Model Overview

This is a RoBERTa base model specifically optimized for Indonesian, suitable for various natural language processing tasks, especially masked language modeling.

Model Features

Indonesian-Specific Model

Pretrained model specifically optimized for Indonesian, excelling in Indonesian language tasks

Based on RoBERTa Architecture

Utilizes the proven RoBERTa architecture to provide robust language understanding capabilities

Efficient Training

Trained efficiently using Google Cloud's TPUv3-8 virtual machine, completed in just 18 hours and 25 minutes

Model Capabilities

Masked Language Modeling

Indonesian Text Understanding

Feature Extraction

Use Cases

Natural Language Processing

Text Completion

Predict masked words in sentences

Example: 'Budi sedang <mask> di sekolah.' can predict appropriate verbs

Feature Extraction

Extract text features for downstream NLP tasks

🚀 Indonesian RoBERTa Base

Indonesian RoBERTa Base is a masked language model based on the RoBERTa architecture, trained on the OSCAR dataset to handle Indonesian text.

🚀 Quick Start

Indonesian RoBERTa Base is a masked language model based on the RoBERTa model. It was trained on the OSCAR dataset, specifically the unshuffled_deduplicated_id subset. The model was trained from scratch and achieved an evaluation loss of 1.798 and an evaluation accuracy of 62.45%.

This model was trained using HuggingFace's Flax framework and is part of the JAX/Flax Community Week organized by HuggingFace. All training was done on a TPUv3 - 8 VM, sponsored by the Google Cloud team.

All necessary scripts used for training could be found in the Files and versions tab, as well as the Training metrics logged via Tensorboard.

✨ Features

Based on the RoBERTa architecture, a powerful masked language model.
Trained from scratch on the OSCAR dataset for Indonesian text.
Achieved good evaluation results with a loss of 1.798 and an accuracy of 62.45%.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

As Masked Language Model

from transformers import pipeline

pretrained_name = "flax-community/indonesian-roberta-base"

fill_mask = pipeline(
    "fill-mask",
    model=pretrained_name,
    tokenizer=pretrained_name
)

fill_mask("Budi sedang <mask> di sekolah.")

Feature Extraction in PyTorch

from transformers import RobertaModel, RobertaTokenizerFast

pretrained_name = "flax-community/indonesian-roberta-base"
model = RobertaModel.from_pretrained(pretrained_name)
tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_name)

prompt = "Budi sedang berada di sekolah."
encoded_input = tokenizer(prompt, return_tensors='pt')
output = model(**encoded_input)

📚 Documentation

Model

Property	Details
Model Type	`indonesian-roberta-base`
#params	124M
Architecture	RoBERTa
Training/Validation data (text)	OSCAR `unshuffled_deduplicated_id` Dataset

Evaluation Results

The model was trained for 8 epochs and the following is the final result once the training ended.

train loss	valid loss	valid accuracy	total time
1.870	1.798	0.6245	18:25:39

🔧 Technical Details

The model is trained using HuggingFace's Flax framework on a TPUv3 - 8 VM sponsored by the Google Cloud team. All necessary training scripts can be found in the Files and versions tab, and training metrics are logged via Tensorboard.

📄 License

This model is released under the MIT license.

👥 Team Members

Wilson Wongso (@w11wo)
Steven Limcorn (@stevenlimcorn)
Samsul Rahmadani (@munggok)
Chew Kok Wah (@chewkokwah)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご