đ Indonesian BERT base model (uncased)
A BERT - base model pre - trained on Indonesian Wikipedia, suitable for various downstream NLP tasks.
đ Quick Start
You can use this model directly with a pipeline for masked language modeling. Here are some code examples to get you started.
đģ Usage Examples
Basic Usage
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cahya/bert-base-indonesian-522M')
>>> unmasker("Ibu ku sedang bekerja [MASK] supermarket")
[{'sequence': '[CLS] ibu ku sedang bekerja di supermarket [SEP]',
'score': 0.7983310222625732,
'token': 1495},
{'sequence': '[CLS] ibu ku sedang bekerja. supermarket [SEP]',
'score': 0.090003103017807,
'token': 17},
{'sequence': '[CLS] ibu ku sedang bekerja sebagai supermarket [SEP]',
'score': 0.025469014421105385,
'token': 1600},
{'sequence': '[CLS] ibu ku sedang bekerja dengan supermarket [SEP]',
'score': 0.017966199666261673,
'token': 1555},
{'sequence': '[CLS] ibu ku sedang bekerja untuk supermarket [SEP]',
'score': 0.016971781849861145,
'token': 1572}]
Advanced Usage
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import BertTokenizer, BertModel
model_name='cahya/bert-base-indonesian-522M'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
And in Tensorflow:
from transformers import BertTokenizer, TFBertModel
model_name='cahya/bert-base-indonesian-522M'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
⨠Features
This is a BERT - base model pre - trained with Indonesian Wikipedia using a masked language modeling (MLM) objective. It is uncased, meaning it does not distinguish between different cases of words. It is one of several other language models pre - trained with Indonesian datasets.
More detail about its usage on downstream tasks (text classification, text generation, etc) is available at Transformer based Indonesian Language Models
đ§ Technical Details
Training Data
This model was pre - trained with 522MB of Indonesian Wikipedia. The texts are lowercased and tokenized using WordPiece with a vocabulary size of 32,000. The inputs of the model are then of the form:
[CLS] Sentence A [SEP] Sentence B [SEP]
đ License
This project is licensed under the MIT license.
đ Documentation
Intended uses & limitations
This model can be used for various downstream NLP tasks such as text classification and text generation. The usage details can be found at Transformer based Indonesian Language Models
How to use
As shown in the usage examples above, you can use this model for masked language modeling tasks directly through a pipeline, and also get text features in PyTorch or Tensorflow.
Information Table
Property |
Details |
Model Type |
BERT - base, uncased, pre - trained with Indonesian Wikipedia using MLM |
Training Data |
522MB of Indonesian Wikipedia, lowercased and tokenized with WordPiece (vocabulary size: 32,000) |