BERT-base-indonesian-1.5G Open-source Indonesian Language Model - Suitable for Various Natural Language Processing Tasks

Bert Base Indonesian 1.5G

Developed by cahya

This is a BERT-based Indonesian pretrained model trained on Wikipedia and newspaper data, suitable for various natural language processing tasks.

Large Language Model OtherOpen Source License:MIT #Indonesian Pretrained #Case Insensitive #Masked Language Modeling

Downloads 40.08k

Release Time : 3/2/2022

Model Overview

This model is an Indonesian pretrained model based on the BERT architecture, trained with masked language modeling objectives, supporting Indonesian text processing tasks.

Model Features

Case Insensitive

The model is case-insensitive, suitable for processing Indonesian texts in different case forms.

Large-scale Pretraining Data

Pretrained using 522MB of Indonesian Wikipedia and 1GB of 2018 Indonesian newspaper data.

WordPiece Tokenization

Uses a WordPiece tokenizer with a 32,000-word vocabulary for text processing.

Model Capabilities

Text Feature Extraction

Masked Language Modeling

Indonesian Text Processing

Use Cases

Natural Language Processing

Text Infilling

Uses masked language modeling to predict missing words in sentences.

Example shows the model accurately predicts 'di' in 'ibu ku sedang bekerja di supermarket'.

Text Feature Extraction

Obtain vector representations of Indonesian texts for downstream tasks.

🚀 Indonesian BERT base model (uncased)

A pre - trained BERT - base model for the Indonesian language, useful for various NLP tasks.

🚀 Quick Start

This is a pre - trained BERT - base model for the Indonesian language. It can be directly used for masked language modeling or extracting text features.

✨ Features

Pre - trained on Indonesian Wikipedia and Indonesian newspapers.
Uncased model, suitable for various downstream NLP tasks.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill - mask', model='cahya/bert - base - indonesian - 1.5G')
>>> unmasker("Ibu ku sedang bekerja [MASK] supermarket")

[{'sequence': '[CLS] ibu ku sedang bekerja di supermarket [SEP]',
  'score': 0.7983310222625732,
  'token': 1495},
 {'sequence': '[CLS] ibu ku sedang bekerja. supermarket [SEP]',
  'score': 0.090003103017807,
  'token': 17},
 {'sequence': '[CLS] ibu ku sedang bekerja sebagai supermarket [SEP]',
  'score': 0.025469014421105385,
  'token': 1600},
 {'sequence': '[CLS] ibu ku sedang bekerja dengan supermarket [SEP]',
  'score': 0.017966199666261673,
  'token': 1555},
 {'sequence': '[CLS] ibu ku sedang bekerja untuk supermarket [SEP]',
  'score': 0.016971781849861145,
  'token': 1572}]

Advanced Usage

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import BertTokenizer, BertModel

model_name='cahya/bert - base - indonesian - 1.5G'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

and in Tensorflow:

from transformers import BertTokenizer, TFBertModel

model_name='cahya/bert - base - indonesian - 1.5G'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertModel.from_pretrained(model_name)
text = "Silakan diganti dengan text apa saja."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 Documentation

It is a BERT - base model pre - trained with Indonesian Wikipedia and Indonesian newspapers using a masked language modeling (MLM) objective. This model is uncased.

This is one of several other language models that have been pre - trained with Indonesian datasets. More detail about its usage on downstream tasks (text classification, text generation, etc) is available at [Transformer based Indonesian Language Models](https://github.com/cahya - wirawan/indonesian - language - models/tree/master/Transformers)

🔧 Technical Details

This model was pre - trained with 522MB of Indonesian Wikipedia and 1GB of Indonesian newspapers. The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are then of the form:

[CLS] Sentence A [SEP] Sentence B [SEP]

📄 License

This project is licensed under the MIT license.

Property	Details
Model Type	BERT - base, uncased, pre - trained on Indonesian datasets
Training Data	Indonesian Wikipedia, Indonesian newspapers (id_newspapers_2018)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご