DistilBERT Base Indonesian Open-Source Model - Designed Specifically for Indonesian Language Processing, No Case Sensitivity for Greater Convenience

Distilbert Base Indonesian

Developed by cahya

This is a distilled version of the Indonesian BERT base model, specifically designed for Indonesian language processing in a case-insensitive format.

Large Language Model

Transformers

OtherOpen Source License:MIT #Indonesian text processing #Distilled BERT #Case-insensitive model

Downloads 1,815

Release Time : 3/2/2022

Model Overview

This model is a pre-trained language model based on Indonesian datasets, suitable for downstream tasks such as text classification and text generation.

Model Features

Distilled model

A distilled version of the Indonesian BERT base model that retains most of the performance while being more lightweight.

Case-insensitive processing

All input text is converted to lowercase, simplifying text preprocessing steps.

Indonesian language optimization

Pre-trained specifically for Indonesian, making it suitable for Indonesian text processing tasks.

Model Capabilities

Masked language modeling

Text feature extraction

Text classification

Text generation

Use Cases

Text processing

Fill-mask

Predict masked words in sentences

As shown in examples, it can accurately predict appropriate words in Indonesian contexts

Text feature extraction

Obtain vector representations of text

Can be used for downstream tasks like classification or similarity calculation

🚀 Indonesian DistilBERT base model (uncased)

This is a distilled version of the Indonesian BERT base model. It's uncased and pre - trained on Indonesian datasets. It can be used for various downstream tasks like text classification and text generation.

🚀 Quick Start

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cahya/distilbert-base-indonesian')
>>> unmasker("Ayahku sedang bekerja di sawah untuk [MASK] padi")

[
  {
    "sequence": "[CLS] ayahku sedang bekerja di sawah untuk menanam padi [SEP]",
    "score": 0.6853187084197998,
    "token": 12712,
    "token_str": "menanam"
  },
  {
    "sequence": "[CLS] ayahku sedang bekerja di sawah untuk bertani padi [SEP]",
    "score": 0.03739545866847038,
    "token": 15484,
    "token_str": "bertani"
  },
  {
    "sequence": "[CLS] ayahku sedang bekerja di sawah untuk memetik padi [SEP]",
    "score": 0.02742469497025013,
    "token": 30338,
    "token_str": "memetik"
  },
  {
    "sequence": "[CLS] ayahku sedang bekerja di sawah untuk penggilingan padi [SEP]",
    "score": 0.02214187942445278,
    "token": 28252,
    "token_str": "penggilingan"
  },
  {
    "sequence": "[CLS] ayahku sedang bekerja di sawah untuk tanam padi [SEP]",
    "score": 0.0185895636677742,
    "token": 11308,
    "token_str": "tanam"
  }
]

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import DistilBertTokenizer, DistilBertModel

model_name='cahya/distilbert-base-indonesian'
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

and in Tensorflow:

from transformers import DistilBertTokenizer, TFDistilBertModel

model_name='cahya/distilbert-base-indonesian'
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = TFDistilBertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

✨ Features

This model is a distilled version of the Indonesian BERT base model. It is uncased and pre - trained on Indonesian datasets. More details about its usage on downstream tasks (text classification, text generation, etc) is available at Transformer based Indonesian Language Models

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The basic usage example is shown above in the "How to use" section, where we use the model for masked language modeling and getting text features in PyTorch and TensorFlow.

📚 Documentation

The model is a distilled version of the Indonesian BERT base model. It is uncased. It's one of the pre - trained language models with Indonesian datasets. You can find more details about its downstream task usage at Transformer based Indonesian Language Models

🔧 Technical Details

This model was distiled with 522MB of Indonesian Wikipedia and 1GB of indonesian newspapers. The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are then of the form: [CLS] Sentence A [SEP] Sentence B [SEP]

📄 License

This model is released under the MIT license.

📋 Information Table

Property	Details
Model Type	Distilled version of Indonesian BERT base model (uncased)
Training Data	522MB of Indonesian Wikipedia and 1GB of indonesian newspapers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご