đ Indonesian BERT base model (uncased)
A pre - trained BERT - base model for the Indonesian language, useful for various NLP tasks.
đ Quick Start
This is a pre - trained BERT - base model for the Indonesian language. It can be directly used for masked language modeling or extracting text features.
⨠Features
- Pre - trained on Indonesian Wikipedia and Indonesian newspapers.
- Uncased model, suitable for various downstream NLP tasks.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill - mask', model='cahya/bert - base - indonesian - 1.5G')
>>> unmasker("Ibu ku sedang bekerja [MASK] supermarket")
[{'sequence': '[CLS] ibu ku sedang bekerja di supermarket [SEP]',
'score': 0.7983310222625732,
'token': 1495},
{'sequence': '[CLS] ibu ku sedang bekerja. supermarket [SEP]',
'score': 0.090003103017807,
'token': 17},
{'sequence': '[CLS] ibu ku sedang bekerja sebagai supermarket [SEP]',
'score': 0.025469014421105385,
'token': 1600},
{'sequence': '[CLS] ibu ku sedang bekerja dengan supermarket [SEP]',
'score': 0.017966199666261673,
'token': 1555},
{'sequence': '[CLS] ibu ku sedang bekerja untuk supermarket [SEP]',
'score': 0.016971781849861145,
'token': 1572}]
Advanced Usage
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import BertTokenizer, BertModel
model_name='cahya/bert - base - indonesian - 1.5G'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
and in Tensorflow:
from transformers import BertTokenizer, TFBertModel
model_name='cahya/bert - base - indonesian - 1.5G'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
đ Documentation
It is a BERT - base model pre - trained with Indonesian Wikipedia and Indonesian newspapers using a masked language modeling (MLM) objective. This model is uncased.
This is one of several other language models that have been pre - trained with Indonesian datasets. More detail about its usage on downstream tasks (text classification, text generation, etc) is available at [Transformer based Indonesian Language Models](https://github.com/cahya - wirawan/indonesian - language - models/tree/master/Transformers)
đ§ Technical Details
This model was pre - trained with 522MB of Indonesian Wikipedia and 1GB of Indonesian newspapers.
The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are then of the form:
[CLS] Sentence A [SEP] Sentence B [SEP]
đ License
This project is licensed under the MIT license.
Property |
Details |
Model Type |
BERT - base, uncased, pre - trained on Indonesian datasets |
Training Data |
Indonesian Wikipedia, Indonesian newspapers (id_newspapers_2018) |