đ Indonesian DistilBERT base model (uncased)
This is a distilled version of the Indonesian BERT base model. It's uncased and pre - trained on Indonesian datasets. It can be used for various downstream tasks like text classification and text generation.
đ Quick Start
How to use
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cahya/distilbert-base-indonesian')
>>> unmasker("Ayahku sedang bekerja di sawah untuk [MASK] padi")
[
{
"sequence": "[CLS] ayahku sedang bekerja di sawah untuk menanam padi [SEP]",
"score": 0.6853187084197998,
"token": 12712,
"token_str": "menanam"
},
{
"sequence": "[CLS] ayahku sedang bekerja di sawah untuk bertani padi [SEP]",
"score": 0.03739545866847038,
"token": 15484,
"token_str": "bertani"
},
{
"sequence": "[CLS] ayahku sedang bekerja di sawah untuk memetik padi [SEP]",
"score": 0.02742469497025013,
"token": 30338,
"token_str": "memetik"
},
{
"sequence": "[CLS] ayahku sedang bekerja di sawah untuk penggilingan padi [SEP]",
"score": 0.02214187942445278,
"token": 28252,
"token_str": "penggilingan"
},
{
"sequence": "[CLS] ayahku sedang bekerja di sawah untuk tanam padi [SEP]",
"score": 0.0185895636677742,
"token": 11308,
"token_str": "tanam"
}
]
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import DistilBertTokenizer, DistilBertModel
model_name='cahya/distilbert-base-indonesian'
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
and in Tensorflow:
from transformers import DistilBertTokenizer, TFDistilBertModel
model_name='cahya/distilbert-base-indonesian'
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = TFDistilBertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
⨠Features
This model is a distilled version of the Indonesian BERT base model. It is uncased and pre - trained on Indonesian datasets. More details about its usage on downstream tasks (text classification, text generation, etc) is available at Transformer based Indonesian Language Models
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
The basic usage example is shown above in the "How to use" section, where we use the model for masked language modeling and getting text features in PyTorch and TensorFlow.
đ Documentation
The model is a distilled version of the Indonesian BERT base model. It is uncased. It's one of the pre - trained language models with Indonesian datasets. You can find more details about its downstream task usage at Transformer based Indonesian Language Models
đ§ Technical Details
This model was distiled with 522MB of Indonesian Wikipedia and 1GB of indonesian newspapers. The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are then of the form:
[CLS] Sentence A [SEP] Sentence B [SEP]
đ License
This model is released under the MIT license.
đ Information Table
Property |
Details |
Model Type |
Distilled version of Indonesian BERT base model (uncased) |
Training Data |
522MB of Indonesian Wikipedia and 1GB of indonesian newspapers |