Indo-roberta-small Open-source Model - Free for Indonesian Text Filling and Feature Extraction

Indo Roberta Small

Developed by w11wo

Indonesian Small RoBERTa is a masked language model based on the RoBERTa architecture, specifically trained for Indonesian language, suitable for text infilling and feature extraction tasks.

Large Language Model OtherOpen Source License:MIT #Indonesian Masked Prediction #Wikipedia Training #Small RoBERTa Architecture

Downloads 50

Release Time : 3/2/2022

Model Overview

This model is an Indonesian masked language model based on the RoBERTa architecture, trained on Indonesian Wikipedia data, primarily used for text infilling and feature extraction.

Model Features

Indonesian Language Optimization

Specifically trained for Indonesian language, suitable for handling Indonesian text tasks.

Lightweight Model

Only 84M parameters, suitable for deployment in resource-limited environments.

Based on RoBERTa Architecture

Utilizes the powerful RoBERTa architecture to provide excellent language understanding capabilities.

Model Capabilities

Text Infilling

Feature Extraction

Indonesian Text Processing

Use Cases

Text Processing

Text Infilling

Fill in missing parts of sentences, such as 'Budi is at school <mask>.'

Feature Extraction

Extract semantic features from text for downstream tasks

🚀 Indo RoBERTa Small

Indo RoBERTa Small is a masked language model that addresses the need for high - quality language processing in the Indonesian language. It leverages the power of the RoBERTa architecture, trained on extensive Indonesian Wikipedia data, to offer accurate language understanding and generation capabilities.

🚀 Quick Start

Indo RoBERTa Small is a masked language model based on the RoBERTa model. It was trained on the latest (late December 2020) Indonesian Wikipedia articles.

The model was trained from scratch and achieved a perplexity of 48.27 on the validation dataset (20% of the articles). Many of the techniques used are based on a Hugging Face tutorial notebook written by Sylvain Gugger, where Sylvain Gugger fine - tuned a DistilGPT - 2 on Wikitext2.

Hugging Face's Transformers library was used to train the model -- utilizing the base RoBERTa model and their Trainer class. PyTorch was used as the backend framework during training, but the model remains compatible with TensorFlow nonetheless.

✨ Features

Based on RoBERTa: Built upon the powerful RoBERTa architecture for effective language understanding.
Trained on Indonesian Wikipedia: Utilizes a large - scale Indonesian dataset for better performance in the Indonesian language.
Cross - framework Compatibility: Compatible with both PyTorch and TensorFlow.

📚 Documentation

Model

Property	Details
Model Type	`indo - roberta - small`
Parameters	84M
Architecture	RoBERTa
Training Data	Indonesian Wikipedia (3.1 GB of text)

Evaluation Results

The model was trained for 3 epochs and the following is the final result once the training ended.

train loss	valid loss	perplexity	total time
4.071	3.876	48.27	3:40:55

💻 Usage Examples

Basic Usage

from transformers import pipeline

pretrained_name = "w11wo/indo-roberta-small"

fill_mask = pipeline(
    "fill-mask",
    model=pretrained_name,
    tokenizer=pretrained_name
)

fill_mask("Budi sedang <mask> di sekolah.")

Advanced Usage

from transformers import RobertaModel, RobertaTokenizerFast

pretrained_name = "w11wo/indo-roberta-small"
model = RobertaModel.from_pretrained(pretrained_name)
tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_name)

prompt = "Budi sedang berada di sekolah."
encoded_input = tokenizer(prompt, return_tensors='pt')
output = model(**encoded_input)

📄 License

This project is licensed under the MIT license.

⚠️ Important Note

Do remember that although the dataset originated from Wikipedia, the model may not always generate factual texts. Additionally, the biases which came from the Wikipedia articles may be carried over into the results of this model.

Author

Indo RoBERTa Small was trained and evaluated by Wilson Wongso. All computation and development are done on Google Colaboratory using their free GPU access.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご