bert-base-indonesian-522M Open Source Model - Supports Indonesian Text Processing and Is Free to Use!

Bert Base Indonesian 522M

Developed by cahya

A BERT base model pretrained on Indonesian Wikipedia using Masked Language Modeling (MLM) objective, case insensitive.

Large Language Model OtherOpen Source License:MIT #Indonesian Pretraining #Case Insensitive #Masked Language Modeling

Downloads 2,799

Release Time : 3/2/2022

Model Overview

This model is a BERT base model pretrained on Indonesian Wikipedia, primarily used for natural language processing tasks such as text classification and text generation.

Model Features

Case Insensitive

The model is case insensitive, e.g., 'indonesia' and 'Indonesia' are treated as the same.

Based on Indonesian Wikipedia

Pretrained on 522MB of Indonesian Wikipedia data, suitable for Indonesian natural language processing tasks.

WordPiece Tokenization

Uses WordPiece tokenization with a vocabulary size of 32,000.

Model Capabilities

Masked Language Modeling

Text Classification

Text Generation

Feature Extraction

Use Cases

Natural Language Processing

Fill Mask

Use the model to predict masked words in a sentence.

As shown in the example, the model accurately predicts the masked word 'di' in 'Ibu ku sedang bekerja [MASK] supermarket'.

Text Feature Extraction

Use the model to extract text feature representations for downstream tasks.

🚀 Indonesian BERT base model (uncased)

A BERT - base model pre - trained on Indonesian Wikipedia, suitable for various downstream NLP tasks.

🚀 Quick Start

You can use this model directly with a pipeline for masked language modeling. Here are some code examples to get you started.

💻 Usage Examples

Basic Usage

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cahya/bert-base-indonesian-522M')
>>> unmasker("Ibu ku sedang bekerja [MASK] supermarket")

[{'sequence': '[CLS] ibu ku sedang bekerja di supermarket [SEP]',
  'score': 0.7983310222625732,
  'token': 1495},
 {'sequence': '[CLS] ibu ku sedang bekerja. supermarket [SEP]',
  'score': 0.090003103017807,
  'token': 17},
 {'sequence': '[CLS] ibu ku sedang bekerja sebagai supermarket [SEP]',
  'score': 0.025469014421105385,
  'token': 1600},
 {'sequence': '[CLS] ibu ku sedang bekerja dengan supermarket [SEP]',
  'score': 0.017966199666261673,
  'token': 1555},
 {'sequence': '[CLS] ibu ku sedang bekerja untuk supermarket [SEP]',
  'score': 0.016971781849861145,
  'token': 1572}]

Advanced Usage

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import BertTokenizer, BertModel

model_name='cahya/bert-base-indonesian-522M'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

And in Tensorflow:

from transformers import BertTokenizer, TFBertModel

model_name='cahya/bert-base-indonesian-522M'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

✨ Features

This is a BERT - base model pre - trained with Indonesian Wikipedia using a masked language modeling (MLM) objective. It is uncased, meaning it does not distinguish between different cases of words. It is one of several other language models pre - trained with Indonesian datasets.

More detail about its usage on downstream tasks (text classification, text generation, etc) is available at Transformer based Indonesian Language Models

🔧 Technical Details

Training Data

This model was pre - trained with 522MB of Indonesian Wikipedia. The texts are lowercased and tokenized using WordPiece with a vocabulary size of 32,000. The inputs of the model are then of the form:

[CLS] Sentence A [SEP] Sentence B [SEP]

📄 License

This project is licensed under the MIT license.

📚 Documentation

Intended uses & limitations

This model can be used for various downstream NLP tasks such as text classification and text generation. The usage details can be found at Transformer based Indonesian Language Models

How to use

As shown in the usage examples above, you can use this model for masked language modeling tasks directly through a pipeline, and also get text features in PyTorch or Tensorflow.

Information Table

Property	Details
Model Type	BERT - base, uncased, pre - trained with Indonesian Wikipedia using MLM
Training Data	522MB of Indonesian Wikipedia, lowercased and tokenized with WordPiece (vocabulary size: 32,000)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご