roberta-base-indonesian-522M: An Open-Source Indonesian Pre-trained Model - Supports Case-insensitive Text Processing

Roberta Base Indonesian 522M

Developed by cahya

An Indonesian pretrained model based on RoBERTa-base architecture, trained on Indonesian Wikipedia data, case insensitive.

Large Language Model OtherOpen Source License:MIT #Indonesian text filling #Case insensitive #Wikipedia pretrained

Downloads 454

Release Time : 3/2/2022

Model Overview

This is a model based on the RoBERTa-base architecture, pretrained on Indonesian Wikipedia data using the Masked Language Modeling (MLM) objective. The model is case insensitive and suitable for Indonesian text processing tasks.

Model Features

Case insensitive

The model does not distinguish between cases, e.g., 'indonesia' and 'Indonesia' are treated as the same.

Based on RoBERTa architecture

Adopts the RoBERTa-base architecture, optimizing the original BERT training method.

Indonesian-specific

Specifically pretrained for Indonesian, suitable for Indonesian text processing tasks.

Model Capabilities

Masked language modeling

Text feature extraction

Indonesian text processing

Use Cases

Text processing

Mask prediction

Predict masked words in text

Can accurately predict missing words in Indonesian text

Text feature extraction

Obtain vector representations of text

Can be used as feature input for downstream NLP tasks

🚀 Indonesian RoBERTa base model (uncased)

A RoBERTa-base model pre-trained on Indonesian Wikipedia, suitable for various NLP tasks.

🚀 Quick Start

The Indonesian RoBERTa base model is pre-trained on Indonesian Wikipedia using a masked language modeling (MLM) objective. It's uncased, treating "indonesia" and "Indonesia" the same.

✨ Features

Pre-trained on Indonesian Data: Utilizes 522MB of Indonesian Wikipedia for pre-training.
Uncased Model: Ignores case differences, simplifying text processing.
Versatile Usage: Applicable to multiple downstream tasks like text classification and generation.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cahya/roberta-base-indonesian-522M')
>>> unmasker("Ibu ku sedang bekerja <mask> supermarket")

Advanced Usage

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import RobertaTokenizer, RobertaModel

model_name='cahya/roberta-base-indonesian-522M'
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

And in Tensorflow:

from transformers import RobertaTokenizer, TFRobertaModel

model_name='cahya/roberta-base-indonesian-522M'
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = TFRobertaModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 Documentation

More detail about its usage on downstream tasks (text classification, text generation, etc) is available at Transformer based Indonesian Language Models

🔧 Technical Details

This model was pre-trained with 522MB of Indonesian Wikipedia. The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are then of the form: <s> Sentence A </s> Sentence B </s>

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご